Getting Started with SWAT

SWAT (SAS Wrapper for Analytics Transfer) is the Python module used to interface with SAS' Cloud Analytics Services (CAS). This module includes classes and functions to call actions on the server, and process the results on the client. This workshop will cover the basics of connecting to CAS from Python, calling CAS actions, interacting with CAS tables, and using the results of the CAS actions on the client.

We will cover several aspects of working with CAS in the following sections. These include:

Exercise 1 - Connecting to CAS

The SWAT module in Python is merely a CAS client, it does not start the server. So the first thing you need to do is get access to a running CAS server. To connect to the server, we need the hostname and port number that the server is running at as well as a way to authenticate with the server.

There are various ways of authenticating with CAS, so you will need to check with your system administrator as to the proper method to use. The simplest form of authentication is username and password, and we will cover that here.

To connect to CAS, we will use the CAS class in the SWAT package. The most basic form of creating a connection is to supply the CAS host, port, username, and password to the CAS class.


In [16]:
import swat

conn = swat.CAS('sasserver.demo.sas.com', 5570)

We can print the CAS object to see the settings for that session. It includes the host, port, username, protocol, session name, and session ID. The host, port, and username should be self-explanatory. The protocol can be 'cas' for the binary protocol, 'http' for the unencrypted REST interface, or 'https' for the encrypted REST interface.

The session name is simply a human-readable tag given to a session so it is easier to identify in actions that display session information. The session ID is a unique identifier given to each session. The session ID can be used to reconnect to a session, or connect from another client.


In [17]:
conn


Out[17]:
CAS('sasserver.demo.sas.com', 5570, 'viyauser', protocol='cas', name='py-session-2', session='95117b0a-44e6-5c4c-8cf3-aa242c1d8316')

Authinfo Authentication

It's generally not a good idea to put your username and password into a program, so the CAS class also supports Authinfo authentication. This is a method of storing your username and password in a file that is readable only by you. The CAS object can then safely read the username and password from that file. This file was used in the connection above which is why you didn't need to enter a username or password.

The basic format of the file is to put a host, port, username, and password on each line.

host my-cas.com port 12345 user my-name password s3cr3t

The hostname must exactly match the hostname used in the CAS object. If no port number is specified on the line, the username and password will be used for any port on that machine.

By default, the file should be stored in ~/.authinfo, and it should be readable only by you. On Linux, the command to do this is:

chmod 0600 ~/.authinfo

Once you have a file with the appropriate authentication information, you can create a CAS object without specifying a username and password.

Exercise 2 - Running CAS Actions

Now that we have a CAS connection, we can try running some CAS actions. Let's start with the serverstatus action. This action displays some basic status information about tha CAS server such as platform and release information, the number of nodes, and information about each node in the grid.

CAS actions are called on CAS objects just like Python methods. We will store the result of the action call in out as well as display the result.


In [18]:
out = conn.serverstatus()
out


NOTE: Grid node action status report: 1 nodes, 26 total actions executed.
Out[18]:
§ About
{'Copyright': 'Copyright © 2014-2016 SAS Institute Inc. All Rights Reserved.', 'Version': '3.01', 'CAS': 'Cloud Analytic Services', 'license': {'gracePeriod': 45, 'maxCPUs': 9999, 'expires': '29Nov2016:00:00:00', 'siteNum': 70068118, 'site': '28Aug2015 English Only AIX Conf', 'warningPeriod': 55}, 'System': {'Model Number': 'x86_64', 'OS Family': 'LIN X64', 'Hostname': 'sasserver', 'OS Name': 'Linux', 'OS Version': '#1 SMP Tue Jul 12 18:30:56 UTC 2016', 'OS Release': '2.6.32-642.3.1.el6.x86_64'}, 'VersionLong': 'V.03.01M0P08282016'}

§ server
Server Status
nodes actions
0 1 26

§ nodestatus
Node Status
name role uptime running stalled
0 sasserver.demo.sas.com controller 0.16 0 0

elapsed 0.00269s · user 0.001s · sys 0.001s · mem 0.0761MB

The result of every CAS action is a CASResults object. CASResults objects are simply a subclass of Python's ordered dictionary, so anything that you can do with Python dictionaries can also be done with CASResults objects. For example, we can print all of the result keys using the keys method.


In [19]:
list(out.keys())


Out[19]:
['About', 'server', 'nodestatus']

You can also access the objects at each key using dictionary syntax.


In [20]:
out['nodestatus']


Out[20]:
Node Status
name role uptime running stalled
0 sasserver.demo.sas.com controller 0.16 0 0

The types of the objects at each key vary from action to action. In the case of serverstatus, the 'About' key holds a dictionary, and the 'server' and 'nodestatus' keys hold DataFrames.


In [21]:
[type(x) for x in out.values()]


Out[21]:
[dict, swat.dataframe.SASDataFrame, swat.dataframe.SASDataFrame]

As you can see, the DataFrames held in the output are SASDataFrame objects. The SASDataFrame object is a subclass of Pandas DataFrame. This means that they can be treated as equivalents. The SASDataFrame object simply adds some extra SAS metadata such as titles, labels, formats, etc.

Of course, since it is a DataFrame, we can use indexing, sorting, and filtering operations on it.


In [22]:
nodestatus = out['nodestatus']
nodestatus[nodestatus['role'] == 'worker']


Out[22]:
Node Status
name role uptime running stalled

In [23]:
nodestatus[nodestatus['role'] == 'controller']


Out[23]:
Node Status
name role uptime running stalled
0 sasserver.demo.sas.com controller 0.16 0 0

In [24]:
nodestatus = nodestatus.set_index('name')
nodestatus.loc['sasserver.demo.sas.com']


Out[24]:
role       controller
uptime           0.16
running             0
stalled             0
Name: sasserver.demo.sas.com, dtype: object

Now that we've covered the basics of calling CAS actions and traversing the results, let's see what other actions are available.

Exercise 3 - Getting Help on CAS Actions

3.1 Using CAS' help Function

So far we've only introduced the serverstatus action. However, there are many actions that are available in all CAS installations, and various other may be available depending which products have been licensed. To display all available actions, you use the help action. (Note that we added the semi-colon at the end to suppress the rendering of the output which is a DataFrame of the same information that is displayed in the notes.)


In [25]:
conn.help();


NOTE: Available Action Sets and Actions:
NOTE:    accessControl
NOTE:       assumeRole - Assumes a role
NOTE:       dropRole - Relinquishes a role
NOTE:       showRolesIn - Shows the currently active role
NOTE:       showRolesAllowed - Shows the roles that a user is a member of
NOTE:       isInRole - Shows whether a role is assumed
NOTE:       isAuthorized - Shows whether access is authorized
NOTE:       isAuthorizedActions - Shows whether access is authorized to actions
NOTE:       isAuthorizedTables - Shows whether access is authorized to tables
NOTE:       isAuthorizedColumns - Shows whether access is authorized to columns
NOTE:       listAllPrincipals - Lists all principals that have explicit access controls
NOTE:       whatIsEffective - Lists effective access and explanations (Origins)
NOTE:       listAcsData - Lists access controls for caslibs, tables, and columns
NOTE:       listAcsActionSet - Lists access controls for an action or action set
NOTE:       repAllAcsCaslib - Replaces all access controls for a caslib
NOTE:       repAllAcsTable - Replaces all access controls for a table
NOTE:       repAllAcsColumn - Replaces all access controls for a column
NOTE:       repAllAcsActionSet - Replaces all access controls for an action set
NOTE:       repAllAcsAction - Replaces all access controls for an action
NOTE:       updSomeAcsCaslib - Adds, deletes, and modifies some access controls for a caslib
NOTE:       updSomeAcsTable - Adds, deletes, and modifies some access controls for a table
NOTE:       updSomeAcsColumn - Adds, deletes, and modifies some access controls for a column
NOTE:       updSomeAcsActionSet - Adds, deletes, and modifies some access controls for an action set
NOTE:       updSomeAcsAction - Adds, deletes, and modifies some access controls for an action
NOTE:       remAllAcsData - Removes all access controls for a caslib, table, or column
NOTE:       remAllAcsActionSet - Removes all access controls for an action set or action
NOTE:       operTableMd - Adds, deletes, and modifies table metadata
NOTE:       operColumnMd - Adds, deletes, and modifies column metadata
NOTE:       operActionSetMd - Adds, deletes, and modifies action set metadata
NOTE:       operActionMd - Adds, deletes, and modifies action metadata
NOTE:       operAdminMd - Assigns users and groups to roles and modifies administrator metadata
NOTE:       listMetadata - Lists the metadata for caslibs, tables, columns, action sets, actions, or administrators
NOTE:       persistMetadata - Persists the access control metadata
NOTE:       createBackup - Creates a backup if one is not in progress
NOTE:       completeBackup - Flags a backup as complete
NOTE:       operBWPaths - Configures a blacklist or whitelist of paths
NOTE:       deleteBWList - Deletes a blacklist or a whitelist
NOTE:    builtins
NOTE:       addNode - Adds a machine to the server
NOTE:       removeNode - Remove one or more machines from the server
NOTE:       help - Shows the parameters for an action or lists all available actions
NOTE:       listNodes - Shows the host names used by the server
NOTE:       loadActionSet - Loads an action set for use in this session
NOTE:       installActionSet - Loads an action set in new sessions automatically
NOTE:       log - Shows and modifies logging levels
NOTE:       queryActionSet - Shows whether an action set is loaded
NOTE:       queryName - Checks whether a name is an action or action set name
NOTE:       reflect - Shows detailed parameter information for an action or all actions in an action set
NOTE:       serverStatus - Shows the status of the server
NOTE:       about - Shows the status of the server
NOTE:       shutdown - Shuts down the server
NOTE:       getUsers - Shows the users from the authentication provider
NOTE:       getGroups - Shows the groups from the authentication provider
NOTE:       userInfo - Shows the user information for your connection
NOTE:       actionSetInfo - Shows the build information from loaded action sets
NOTE:       history - Shows the actions that were run in this session
NOTE:       casCommon - Provides parameters that are common to many actions
NOTE:       ping - Sends a single request to the server to confirm that the connection is working
NOTE:       echo - Prints the supplied parameters to the client log
NOTE:       modifyQueue - Modifies the action response queue settings
NOTE:       getLicenseInfo - Shows the license information for a SAS product
NOTE:       refreshLicense - Refresh SAS license information from a file
NOTE:       httpAddress - Shows the HTTP address for the server monitor
NOTE:    configuration
NOTE:       getServOpt - displays the value of a server option
NOTE:       listServOpts - Displays the server options and server values
NOTE:    dataPreprocess
NOTE:       rustats - Computes robust univariate statistics, centralized moments, quantiles, and frequency distribution statistics
NOTE:       impute - Performs data matrix (variable) imputation
NOTE:       outlier - Performs outlier detection and treatment
NOTE:       binning - Performs unsupervised variable discretization
NOTE:       discretize - Performs supervised and unsupervised variable discretization
NOTE:       histogram - Generates histogram bins and simple bin-based statistics for numeric variables
NOTE:       transform - Performs pipelined variable imputation, outlier detection and treatment, functional transformation, binning, and robust univariate statistics to evaluate the quality of the transformation
NOTE:       kde - Computes kernel density estimation
NOTE:    dataStep
NOTE:       runCode - Runs DATA step code
NOTE:    percentile
NOTE:       percentile - Percentile
NOTE:       boxPlot - BoxPlot
NOTE:       assess - Assess
NOTE:    search
NOTE:       searchIndex - Searches for a query against an index and retrieves records, documents, and tuples that are relevant to that query
NOTE:       searchAggregate - Aggregates certain fields in a table that is usually generated by searchIndex
NOTE:       valueCount - value count for multiple fields
NOTE:       buildIndex - Creates an empty index using a schema (the first step of Search)
NOTE:       getSchema - Gets the schema of an index
NOTE:       appendIndex - Loads data to an index after the buildIndex action is performed
NOTE:       deleteDocuments - Delete a portion of documents from index
NOTE:    session
NOTE:       listSessions - Displays a list of the sessions on the server
NOTE:       addNodeStatus - Lists details about machines currently being added to the server
NOTE:       timeout - Changes the time-out for a session
NOTE:       endSession - Ends the current session
NOTE:       sessionId - Displays the name and UUID of the current session
NOTE:       sessionName - Changes the name of the current session
NOTE:       sessionStatus - Displays the status of the current session
NOTE:       listresults - Lists the saved results for a session
NOTE:       batchresults - Change current action to batch results
NOTE:       fetchresult - Fetch the specified saved result for a session
NOTE:       flushresult - Flush the saved result for this session
NOTE:       setLocale - Changes the locale for the current session
NOTE:       metrics - Displays the metrics for each action after it executes
NOTE:    sessionProp
NOTE:       setSessOpt - Sets a session option
NOTE:       getSessOpt - Displays the value of a session option
NOTE:       listSessOpts - Displays the session options and session values
NOTE:       addFmtLib - Adds a format library
NOTE:       listFmtLibs - Lists the format libraries that are associated with the session
NOTE:       setFmtSearch - Sets the format libraries to search
NOTE:       listFmtSearch - Shows the format library search order
NOTE:       dropFmtLib - Drops a format library from global scope for all sessions
NOTE:       deleteFormat - Deletes a format from a format library
NOTE:       addFormat - Adds a format to a format library
NOTE:       listFmtValues - Shows the values for a format
NOTE:       saveFmtLib - Saves a format library
NOTE:       promoteFmtLib - Promotes a format library to global scope for all sessions
NOTE:       listFmtRanges - Displays the range information for a format
NOTE:    simple
NOTE:       mdSummary - Calculates multidimensional summaries of numeric variables
NOTE:       numRows - Shows the number of rows in a Cloud Analytic Services table
NOTE:       summary - Generates descriptive statistics of numeric variables such as the sample mean, sample variance, sample size, sum of squares, and so on
NOTE:       correlation - Generates a matrix of Pearson product-moment correlation coefficients
NOTE:       regression - Performs a linear regression up to 3rd-order polynomials
NOTE:       crossTab - Performs one-way or two-way tabulations
NOTE:       distinct - Computes the distinct number of values of the variables in the variable list
NOTE:       topK - Returns the top-K and bottom-K distinct values of each variable included in the variable list based on a user-specified ranking order
NOTE:       groupBy - Builds BY groups in terms of the variable value combinations given the variables in the variable list
NOTE:       freq - Generates a frequency distribution for one or more variables
NOTE:       paraCoord - Generates a parallel coordinates plot of the variables in the variable list
NOTE:    table
NOTE:       view - Creates a view from files or tables
NOTE:       attribute - Manages extended table attributes
NOTE:       upload - Transfers binary data to the server to create objects like tables
NOTE:       loadTable - Loads a table from a caslib's data source
NOTE:       tableExists - Checks whether a table has been loaded
NOTE:       columnInfo - Shows column information
NOTE:       fetch - Fetches rows from a table or view
NOTE:       save - Saves a table to a caslib's data source
NOTE:       addTable - Add a table by sending it from the client to the server
NOTE:       tableInfo - Shows information about a table
NOTE:       tableDetails - Get detailed information about a table
NOTE:       dropTable - Drops a table
NOTE:       deleteSource - Delete a table or file from a caslib's data source
NOTE:       fileInfo - Lists the files in a caslib's data source
NOTE:       promote - Promote a table to global scope
NOTE:       addCaslib - Adds a new caslib to enable access to a data source
NOTE:       dropCaslib - Drops a caslib
NOTE:       caslibInfo - Shows caslib information
NOTE:       queryCaslib - Checks whether a caslib exists
NOTE:       partition - Partitions a table
NOTE:       recordCount - Shows the number of rows in a Cloud Analytic Services table
NOTE:       loadDataSource - Loads one or more data source interfaces
NOTE:       update - Updates rows in a table

You can also ask for help on specifc action sets and actions.


In [26]:
conn.help(actionset='builtins');


NOTE: Information for action set 'builtins':
NOTE:    builtins
NOTE:       addNode - Adds a machine to the server
NOTE:       removeNode - Remove one or more machines from the server
NOTE:       help - Shows the parameters for an action or lists all available actions
NOTE:       listNodes - Shows the host names used by the server
NOTE:       loadActionSet - Loads an action set for use in this session
NOTE:       installActionSet - Loads an action set in new sessions automatically
NOTE:       log - Shows and modifies logging levels
NOTE:       queryActionSet - Shows whether an action set is loaded
NOTE:       queryName - Checks whether a name is an action or action set name
NOTE:       reflect - Shows detailed parameter information for an action or all actions in an action set
NOTE:       serverStatus - Shows the status of the server
NOTE:       about - Shows the status of the server
NOTE:       shutdown - Shuts down the server
NOTE:       getUsers - Shows the users from the authentication provider
NOTE:       getGroups - Shows the groups from the authentication provider
NOTE:       userInfo - Shows the user information for your connection
NOTE:       actionSetInfo - Shows the build information from loaded action sets
NOTE:       history - Shows the actions that were run in this session
NOTE:       casCommon - Provides parameters that are common to many actions
NOTE:       ping - Sends a single request to the server to confirm that the connection is working
NOTE:       echo - Prints the supplied parameters to the client log
NOTE:       modifyQueue - Modifies the action response queue settings
NOTE:       getLicenseInfo - Shows the license information for a SAS product
NOTE:       refreshLicense - Refresh SAS license information from a file
NOTE:       httpAddress - Shows the HTTP address for the server monitor

In [27]:
conn.help(action='tableinfo');


NOTE: Information for action 'table.tableInfo':
NOTE: The following parameters are accepted.  Default values are shown.
NOTE:    string name=NULL (alias: table),
NOTE:       specifies the table name.
NOTE:    string caslib=NULL,
NOTE:       specifies the caslib containing the table that you want to use with the action. By default, the active caslib is used. Specify a value only if you need to access a table from a different caslib.
NOTE:    boolean quiet=false (alias: silent)
NOTE:       when set to True, attempting to show information for a table that does not exist returns an OK status and severity. When set to False, attempting to show information for a table that does not exist returns an error.

3.2 Using Python's help Function

In addition to the help action in CAS, you can also use Python's help function to retrieve the help content in a Pythonic way. This will display both the help for the CAS action as well as information about the Python object driving the action.


In [28]:
help(conn.tableinfo)


Help on table.Tableinfo in module swat.cas.actions object:

class table.Tableinfo(CASAction)
 |  Shows information about a table
 |  
 |  Parameters
 |  ----------
 |  name : string or CASTable, optional
 |      specifies the table name.
 |  
 |  caslib : string, optional
 |      specifies the caslib containing the table that you want to use with
 |      the action. By default, the active caslib is used. Specify a value
 |      only if you need to access a table from a different caslib.
 |  
 |  quiet : boolean, optional
 |      when set to True, attempting to show information for a table that
 |      does not exist returns an OK status and severity. When set to False,
 |      attempting to show information for a table that does not exist
 |      returns an error.
 |      Default: False
 |  
 |  Returns
 |  -------
 |  Tableinfo object
 |  
 |  Method resolution order:
 |      table.Tableinfo
 |      CASAction
 |      swat.cas.utils.params.ParamManager
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __call__(_self_, name=None, caslib=None, quiet=None, **kwargs)
 |      Shows information about a table
 |      
 |      Parameters
 |      ----------
 |      name : string or CASTable, optional
 |          specifies the table name.
 |      
 |      caslib : string, optional
 |          specifies the caslib containing the table that you want to use with
 |          the action. By default, the active caslib is used. Specify a value
 |          only if you need to access a table from a different caslib.
 |      
 |      quiet : boolean, optional
 |          when set to True, attempting to show information for a table that
 |          does not exist returns an OK status and severity. When set to False,
 |          attempting to show information for a table that does not exist
 |          returns an error.
 |          Default: False
 |      
 |      Returns
 |      -------
 |      CASResults object
 |  
 |  __init__(_self_, name=None, caslib=None, quiet=None, **kwargs)
 |      Shows information about a table
 |      
 |      Parameters
 |      ----------
 |      name : string or CASTable, optional
 |          specifies the table name.
 |      
 |      caslib : string, optional
 |          specifies the caslib containing the table that you want to use with
 |          the action. By default, the active caslib is used. Specify a value
 |          only if you need to access a table from a different caslib.
 |      
 |      quiet : boolean, optional
 |          when set to True, attempting to show information for a table that
 |          does not exist returns an OK status and severity. When set to False,
 |          attempting to show information for a table that does not exist
 |          returns an error.
 |          Default: False
 |      
 |      Returns
 |      -------
 |      Tableinfo object
 |  
 |  get_param(_self_, key)
 |      Get the value of an action parameter
 |      
 |      Parameters
 |      ----------
 |      key : string
 |          The fully-qualified name (e.g., table.name) of the parameter to retrieve.
 |      
 |      Valid Parameters
 |      ----------------
 |      name : string or CASTable, optional
 |          specifies the table name.
 |      
 |      caslib : string, optional
 |          specifies the caslib containing the table that you want to use with
 |          the action. By default, the active caslib is used. Specify a value
 |          only if you need to access a table from a different caslib.
 |      
 |      quiet : boolean, optional
 |          when set to True, attempting to show information for a table that
 |          does not exist returns an OK status and severity. When set to False,
 |          attempting to show information for a table that does not exist
 |          returns an error.
 |          Default: False
 |      
 |      Returns
 |      -------
 |      any
 |          The value of the speciifed parameter.
 |  
 |  get_params(_self_, *keys)
 |      Get the value of one or more action parameters
 |      
 |      Parameters
 |      ----------
 |      *keys : one or more strings
 |          The fully-qualified names (e.g., table.name) of the parameters to retrieve.
 |      
 |      Valid Parameters
 |      ----------------
 |      name : string or CASTable, optional
 |          specifies the table name.
 |      
 |      caslib : string, optional
 |          specifies the caslib containing the table that you want to use with
 |          the action. By default, the active caslib is used. Specify a value
 |          only if you need to access a table from a different caslib.
 |      
 |      quiet : boolean, optional
 |          when set to True, attempting to show information for a table that
 |          does not exist returns an OK status and severity. When set to False,
 |          attempting to show information for a table that does not exist
 |          returns an error.
 |          Default: False
 |      
 |      Returns
 |      -------
 |      dict
 |          A dictionary of key value pairs containing the requested parameters.
 |  
 |  set_param(_self_, *args, **kwargs)
 |      Set one or more action parameters
 |      
 |      Parameters
 |      ----------
 |      *args : string / any pairs, optional
 |          Parameters can be specified as fully-qualified names (e.g, table.name)
 |          and values as subsequent arguments.  Any number of name / any pairs
 |          can be specified.
 |      **kwargs : any, optional
 |          Parameters can be specified as any number of keyword arguments.
 |      
 |      Examples
 |      --------
 |      #
 |      # String / any pairs
 |      #
 |      > summ = s.simple.Sumamry()
 |      > summ.set_param('table.name', 'iris',
 |                       'table.singlepass', True,
 |                       'casout.name', 'iris_summary')
 |      > print(summ)
 |      ?.simple.Summary(table={'name': 'iris', 'singlepass': True},
 |                       casout={'name': 'iris_summary'})
 |      
 |      #
 |      # Keywords
 |      #
 |      > summ.set_param(casout=dict(name='iris_out'))
 |      > print(summ)
 |      ?.simple.Summary(table={'name': 'iris', 'singlepass': True},
 |                       casout={'name': 'iris_out'})
 |      
 |      Valid Parameters
 |      ----------------
 |      name : string or CASTable, optional
 |          specifies the table name.
 |      
 |      caslib : string, optional
 |          specifies the caslib containing the table that you want to use with
 |          the action. By default, the active caslib is used. Specify a value
 |          only if you need to access a table from a different caslib.
 |      
 |      quiet : boolean, optional
 |          when set to True, attempting to show information for a table that
 |          does not exist returns an OK status and severity. When set to False,
 |          attempting to show information for a table that does not exist
 |          returns an error.
 |          Default: False
 |      
 |      Returns
 |      -------
 |      None
 |  
 |  set_params(_self_, *args, **kwargs)
 |      Set one or more action parameters
 |      
 |      Parameters
 |      ----------
 |      *args : string / any pairs, optional
 |          Parameters can be specified as fully-qualified names (e.g, table.name)
 |          and values as subsequent arguments.  Any number of name / any pairs
 |          can be specified.
 |      **kwargs : any, optional
 |          Parameters can be specified as any number of keyword arguments.
 |      
 |      Examples
 |      --------
 |      #
 |      # String / any pairs
 |      #
 |      > summ = s.simple.Sumamry()
 |      > summ.set_param('table.name', 'iris',
 |                       'table.singlepass', True,
 |                       'casout.name', 'iris_summary')
 |      > print(summ)
 |      ?.simple.Summary(table={'name': 'iris', 'singlepass': True},
 |                       casout={'name': 'iris_summary'})
 |      
 |      #
 |      # Keywords
 |      #
 |      > summ.set_param(casout=dict(name='iris_out'))
 |      > print(summ)
 |      ?.simple.Summary(table={'name': 'iris', 'singlepass': True},
 |                       casout={'name': 'iris_out'})
 |      
 |      Valid Parameters
 |      ----------------
 |      name : string or CASTable, optional
 |          specifies the table name.
 |      
 |      caslib : string, optional
 |          specifies the caslib containing the table that you want to use with
 |          the action. By default, the active caslib is used. Specify a value
 |          only if you need to access a table from a different caslib.
 |      
 |      quiet : boolean, optional
 |          when set to True, attempting to show information for a table that
 |          does not exist returns an OK status and severity. When set to False,
 |          attempting to show information for a table that does not exist
 |          returns an error.
 |          Default: False
 |      
 |      Returns
 |      -------
 |      None
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |  
 |  all_params = {'caslib', 'name', 'quiet'}
 |  
 |  param_names = ['name', 'caslib', 'quiet']
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from CASAction:
 |  
 |  __iter__(self)
 |      Call the action and iterate over the results
 |  
 |  invoke(self, **kwargs)
 |      Invoke the action
 |      
 |      Parameters
 |      ----------
 |      **kwargs : any, optional
 |          Arbitrary key/value pairs to add to the arguments sent to the
 |          action.  These key/value pairs are not added to the collection
 |          of parameters set on the action object.  They are only used in
 |          this call.
 |      
 |      Returns
 |      -------
 |      self
 |          Returns the CASAction object itself
 |  
 |  retrieve = __call__(self, **kwargs)
 |      Call the action
 |      
 |      Parameters
 |      ----------
 |      **kwargs : any, optional
 |          Arbitrary key/value pairs to add to the arguments sent to the
 |          action.  These key/value pairs are not added to the collection
 |          of parameters set on the action object.  They are only used in
 |          this call.
 |      
 |      Returns
 |      -------
 |      CASResults object
 |          Collection of results from the action call
 |  
 |  ----------------------------------------------------------------------
 |  Class methods inherited from CASAction:
 |  
 |  from_reflection(asname, actinfo, connection) from builtins.type
 |      Construct a CASAction class from reflection information
 |      
 |      Parameters
 |      ----------
 |      asname : string
 |          The action set name
 |      actinfo : dict
 |          The reflection information for the action
 |      connection : CAS object
 |          The connection to associate with the CASAction
 |      defaults : dict
 |          Default parameters for the action
 |      
 |      Returns
 |      -------
 |      CASAction class
 |  
 |  get_connection() from builtins.type
 |      Return the registered connection
 |      
 |      The connection is only held by a weak reference.  If the
 |      connection no longer exists, a SWATError is raised.
 |      
 |      Raises
 |      ------
 |      SWATError
 |          If the registered connection no longer exists
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes inherited from CASAction:
 |  
 |  trait_names = None
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from swat.cas.utils.params.ParamManager:
 |  
 |  __delattr__(self, name)
 |      Delete an attribute
 |  
 |  __enter__(self)
 |  
 |  __exit__(self, type, value, traceback)
 |  
 |  __getattr__(self, name)
 |      Get named attribute
 |  
 |  __repr__(self)
 |      Return repr(self).
 |  
 |  __setattr__(self, name, value)
 |      Set an attribute
 |  
 |  __str__(self)
 |      Return str(self).
 |  
 |  del_param = del_params(self, *keys)
 |      Delete parameters
 |      
 |      Parameters
 |      ----------
 |      *keys : strings
 |         Names of parameters to delete
 |      
 |      Returns
 |      -------
 |      None
 |  
 |  del_params(self, *keys)
 |      Delete parameters
 |      
 |      Parameters
 |      ----------
 |      *keys : strings
 |         Names of parameters to delete
 |      
 |      Returns
 |      -------
 |      None
 |  
 |  has_param = has_params(self, *keys)
 |      Return a boolean indicating whether or not the parameters exist
 |      
 |      Parameters
 |      ----------
 |      *keys : one or more strings
 |          Names of parameters
 |      
 |      Returns
 |      -------
 |      True or False
 |  
 |  has_params(self, *keys)
 |      Return a boolean indicating whether or not the parameters exist
 |      
 |      Parameters
 |      ----------
 |      *keys : one or more strings
 |          Names of parameters
 |      
 |      Returns
 |      -------
 |      True or False
 |  
 |  to_dict(self)
 |      Return the parameters as a dictionary
 |  
 |  to_json(self, *args, **kwargs)
 |      Convert parameters to JSON
 |      
 |      Parameters
 |      ----------
 |      *args : any, optional
 |          Additional arguments to json.dumps
 |      **kwargs : any, optional
 |          Additional arguments to json.dumps
 |      
 |      Returns
 |      -------
 |      string
 |  
 |  to_params = to_dict(self)
 |      Return the parameters as a dictionary
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from swat.cas.utils.params.ParamManager:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)

3.3 Using IPython's or Jupyter's Help Functionality

If you are using IPython or a Jupyter notebook, it has another variant of the help information stored in Python objects. To invoke IPython/Jupyter's help, you simply append a ? to the class/method/function you want help for.


In [29]:
conn.tableinfo?

3.4 Tab Completion

While it doesn't display true help information, tab completion of action names is also available in IPython and Jupyter. All currently loaded action sets and actions will show up in the tab completion for a CAS connection object.


In [30]:
#conn.

Exercise 4 - Loading Data into CAS

So far we've run a few CAS actions and taken a look at the output that they create, but we haven't done anything that requires data. In order to do that, we need to know how to load data into a CAS table.

There are many ways of getting data into CAS. We'll look at a couple of them here.

4.1 Loading Client-Side Data into CAS

The most Pythonic way of getting data into CAS is to use the Pandas-style data readers on the CAS connection object. When we refer to "client side" data, we mean data that is located on the machine that Python is running on. "Server side" data is data that is on the machine that CAS is running on. These may or may not be the same machine.

One thing to keep in mind when loading client-side data is that all data must be uploaded to the CAS machine. This may be fine for small to medium sized data sets, but for larger data sets you would probably want to move on to the section 4.3 on loading server-side data.

One of the simplest data file formats is CSV. Pandas' read_csv function is a very capable CSV reader with many options. You can use the SWAT version of that same function on the CAS connection object. It supports all of the same options as the Pandas version (and actually uses it in the background). The difference is that the SWAT version returns a CASTable object rather than a DataFrame.


In [31]:
tbl = conn.read_csv('https://raw.githubusercontent.com/'
                    'sassoftware/sas-viya-programming/master/data/cars.csv')
tbl


Out[31]:
CASTable('_T_8B910465_7F1F62C95F88', caslib='CASUSER(viyauser)')

Since the code above uses a Python data parser, the content of the URL must be downloaded to the client machine first, then it gets parsed. The data is then uploaded to CAS. You may want to keep this in mind since it does mean that the data is transferred twice. If you have larger data files, you may want to use an alternative way of loading them.

A CASTable object is a Python object interface to a CAS table. You can call CAS actions on it just like a CAS connection object. It also supports much of the Pandas DataFrame API. This means that you can treat it like a DataFrame, but all of the operations will be performed on the CAS server using CAS actions.

The first thing we'll do is take a look at the column information using the columninfo CAS action.


In [32]:
conn.columninfo(table=tbl)


Out[32]:
§ ColumnInfo
Column ID Type RawLength FormattedLength NFL NFD
0 Make 1 varchar 13 13 0 0
1 Model 2 varchar 39 39 0 0
2 Type 3 varchar 6 6 0 0
3 Origin 4 varchar 6 6 0 0
4 DriveTrain 5 varchar 5 5 0 0
5 MSRP 6 double 8 12 0 0
6 Invoice 7 double 8 12 0 0
7 EngineSize 8 double 8 12 0 0
8 Cylinders 9 double 8 12 0 0
9 Horsepower 10 double 8 12 0 0
10 MPG_City 11 double 8 12 0 0
11 MPG_Highway 12 double 8 12 0 0
12 Weight 13 double 8 12 0 0
13 Wheelbase 14 double 8 12 0 0
14 Length 15 double 8 12 0 0

elapsed 0.00065s · mem 0.132MB

We can also fetch a sample of the data using the fetch CAS action.


In [33]:
conn.fetch(table=tbl, to=5)


Out[33]:
§ Fetch
Selected Rows from Table _T_8B910465_7F1F62C95F88
Make Model Type Origin DriveTrain MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length
0 Acura MDX SUV Asia All 36945.0 33337.0 3.5 6.0 265.0 17.0 23.0 4451.0 106.0 189.0
1 Acura RSX Type S 2dr Sedan Asia Front 23820.0 21761.0 2.0 4.0 200.0 24.0 31.0 2778.0 101.0 172.0
2 Acura TSX 4dr Sedan Asia Front 26990.0 24647.0 2.4 4.0 200.0 22.0 29.0 3230.0 105.0 183.0
3 Acura TL 4dr Sedan Asia Front 33195.0 30299.0 3.2 6.0 270.0 20.0 28.0 3575.0 108.0 186.0
4 Acura 3.5 RL 4dr Sedan Asia Front 43755.0 39014.0 3.5 6.0 225.0 18.0 24.0 3880.0 115.0 197.0

elapsed 0.0065s · user 0.002s · mem 0.569MB

Calling Actions from CASTable Objects

As you can see in the previous two code samples, we ran the CAS action on the connection object and supplied the table in the table= parameter. Since this is such a common thing to do, the CASTable object can also be used as the action dispatcher. In this case, you simply call the action on the CASTable object itself, and don't specify a table= parameter. The table= parameter will be automatically populated with the table information.


In [34]:
tbl.fetch(to=5)


Out[34]:
§ Fetch
Selected Rows from Table _T_8B910465_7F1F62C95F88
Make Model Type Origin DriveTrain MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length
0 Acura MDX SUV Asia All 36945.0 33337.0 3.5 6.0 265.0 17.0 23.0 4451.0 106.0 189.0
1 Acura RSX Type S 2dr Sedan Asia Front 23820.0 21761.0 2.0 4.0 200.0 24.0 31.0 2778.0 101.0 172.0
2 Acura TSX 4dr Sedan Asia Front 26990.0 24647.0 2.4 4.0 200.0 22.0 29.0 3230.0 105.0 183.0
3 Acura TL 4dr Sedan Asia Front 33195.0 30299.0 3.2 6.0 270.0 20.0 28.0 3575.0 108.0 186.0
4 Acura 3.5 RL 4dr Sedan Asia Front 43755.0 39014.0 3.5 6.0 225.0 18.0 24.0 3880.0 115.0 197.0

elapsed 0.00218s · user 0.002s · mem 0.569MB

4.2 Parsing Client-Side Data on the Server

Another option for processing client-side data is to upload it as-is and let CAS parse it. This is generally faster since there isn't the additional overhead of converting Python objects to data in a form that CAS can read it. In addition, some file formats such as CSV can be parsed in parallel in CAS which gives another performance boost.

The example below uploads the same file as above, but this time CAS is doing the parsing. You should note that since this CSV file is a URL, the data must be downloaded to the client machine and then uploaded to CAS which adds a bit more overhead.


In [35]:
tbl2 = conn.upload_file('https://raw.githubusercontent.com/'
                        'sassoftware/sas-viya-programming/master/data/cars.csv')


NOTE: Cloud Analytic Services made the uploaded file available as table CARS in caslib CASUSER(viyauser).
NOTE: The table CARS has been created in caslib CASUSER(viyauser) from binary data uploaded to Cloud Analytic Services.

In [36]:
tbl2.fetch(to=5)


Out[36]:
§ Fetch
Selected Rows from Table CARS
Make Model Type Origin DriveTrain MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length
0 Acura MDX SUV Asia All 36945.0 33337.0 3.5 6.0 265.0 17.0 23.0 4451.0 106.0 189.0
1 Acura RSX Type S 2dr Sedan Asia Front 23820.0 21761.0 2.0 4.0 200.0 24.0 31.0 2778.0 101.0 172.0
2 Acura TSX 4dr Sedan Asia Front 26990.0 24647.0 2.4 4.0 200.0 22.0 29.0 3230.0 105.0 183.0
3 Acura TL 4dr Sedan Asia Front 33195.0 30299.0 3.2 6.0 270.0 20.0 28.0 3575.0 108.0 186.0
4 Acura 3.5 RL 4dr Sedan Asia Front 43755.0 39014.0 3.5 6.0 225.0 18.0 24.0 3880.0 115.0 197.0

elapsed 0.00138s · user 0.002s · mem 0.567MB

There is also an upload_frame method on the CAS connection object that will upload a Pandas DataFrame. This method simply exports the DataFrame to CSV and uploads that file to the server.

4.3 Loading Server-Side Data into CAS

If you have larger data sets, you will likely want to keep them on the same machine as your CAS server so that they can be loaded directly. The action used to load data from a file directly is loadtable.


In [37]:
conn.caslibinfo()


Out[37]:
§ CASLibInfo
Name Type Description Path Definition Subdirs Local Active Personal
0 CASUSER(viyauser) PATH Personal File System Caslib /home/viyauser/ 1.0 0.0 1.0 1.0
1 Formats PATH CAS Formats caslib /opt/sas/viya/config/data/cas/default/formats/ 1.0 0.0 0.0 0.0
2 Public PATH CAS Public caslib /opt/sas/viya/config/data/cas/default/public/ 1.0 0.0 0.0 0.0

elapsed 0.00229s · sys 0.002s · mem 0.055MB


In [38]:
conn.fileinfo(caslib='casuser')


Out[38]:
§ FileInfo
Permission Owner Group Name Size Encryption Time
0 drwxr-xr-x viyauser sas Output 4096 17Jun2016:17:17:27
1 drwxr-xr-x viyauser sas Desktop 4096 30Aug2016:08:28:53
2 drwxrwxr-x viyauser sas Examples 4096 08Jun2016:12:10:03
3 drwxr-xr-x viyauser sas sasuser.viya 4096 14Jul2016:15:14:07
4 -rwxr-xr-x viyauser sas othermycars.sashdat 76848 NONE 31Aug2016:15:10:49
5 drwxr-xr-x viyauser sas Data 4096 01Sep2016:19:34:34
6 drwxr-xr-x viyauser sas Examples_Python 4096 09Sep2016:13:45:01

elapsed 0.00914s · user 0.001s · sys 0.001s · mem 0.0788MB


In [39]:
conn.fileinfo(caslib='casuser', path='Data')


Out[39]:
§ FileInfo
Permission Owner Group Name Size Encryption Time
0 -rw-r--r-- viyauser sas cars.csv 42177 01Sep2016:19:19:21

elapsed 0.00139s · user 0.001s · mem 0.0606MB


In [40]:
out = conn.loadtable(path='Data/cars.csv', caslib='casuser')
out


NOTE: Cloud Analytic Services made the file Data/cars.csv available as table DATA.CARS in caslib CASUSER(viyauser).
Out[40]:
§ caslib
CASUSER(viyauser)

§ tableName
DATA.CARS

§ casTable
CASTable('DATA.CARS', caslib='CASUSER(viyauser)')

elapsed 0.00323s · user 0.005s · mem 8.35MB

You'll notice that we specified both path= and caslib= parameters. Files are not loaded using an absolute file name. They are always relative to a CASLib. CASlibs can be configured to point to file system directories, databases, and other data sources. They can also be used to set access controls for various operations. We won't go into those details here. In the case of the example above, the file was located relative to the casuser CASLib which points to the user's home directory.

The result of the loadtable action is a CASResults object containing three keys: 'caslib', 'tableName', and 'casTable'. The 'casTable' key is simply a CASTable object constructed from the 'caslib' and 'tableName' keys.


In [41]:
tbl3 = out['casTable']
tbl3


Out[41]:
CASTable('DATA.CARS', caslib='CASUSER(viyauser)')

In [42]:
tbl3.fetch(to=5)


Out[42]:
§ Fetch
Selected Rows from Table DATA.CARS
Make Model Type Origin DriveTrain MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length
0 Acura MDX SUV Asia All 36945.0 33337.0 3.5 6.0 265.0 17.0 23.0 4451.0 106.0 189.0
1 Acura RSX Type S 2dr Sedan Asia Front 23820.0 21761.0 2.0 4.0 200.0 24.0 31.0 2778.0 101.0 172.0
2 Acura TSX 4dr Sedan Asia Front 26990.0 24647.0 2.4 4.0 200.0 22.0 29.0 3230.0 105.0 183.0
3 Acura TL 4dr Sedan Asia Front 33195.0 30299.0 3.2 6.0 270.0 20.0 28.0 3575.0 108.0 186.0
4 Acura 3.5 RL 4dr Sedan Asia Front 43755.0 39014.0 3.5 6.0 225.0 18.0 24.0 3880.0 115.0 197.0

elapsed 0.00127s · user 0.001s · mem 0.568MB

Since we commonly just want the CASTable object from the loadtable action, there is a wrapper method on the CAS connection object that calls loadtable behind-the-scenes and just returns the 'casTable' key.

In this case, we'll load the same data file but we'll also specify replace=True in the casout= parameter since it will attempt to create the same output table name.


In [43]:
tbl4 = conn.load_path('Data/cars.csv', caslib='casuser', casout=dict(replace=True))
tbl4


Out[43]:
CASTable('DATA.CARS', caslib='CASUSER(viyauser)')

In [44]:
tbl4.fetch(to=5)


Out[44]:
§ Fetch
Selected Rows from Table DATA.CARS
Make Model Type Origin DriveTrain MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length
0 Acura MDX SUV Asia All 36945.0 33337.0 3.5 6.0 265.0 17.0 23.0 4451.0 106.0 189.0
1 Acura RSX Type S 2dr Sedan Asia Front 23820.0 21761.0 2.0 4.0 200.0 24.0 31.0 2778.0 101.0 172.0
2 Acura TSX 4dr Sedan Asia Front 26990.0 24647.0 2.4 4.0 200.0 22.0 29.0 3230.0 105.0 183.0
3 Acura TL 4dr Sedan Asia Front 33195.0 30299.0 3.2 6.0 270.0 20.0 28.0 3575.0 108.0 186.0
4 Acura 3.5 RL 4dr Sedan Asia Front 43755.0 39014.0 3.5 6.0 225.0 18.0 24.0 3880.0 115.0 197.0

elapsed 0.00135s · user 0.001s · mem 0.569MB

Exercise 5 - Using CASTable as a DataFrame

We have shown a few examples of how the CASTable object acts like a Pandas DataFrame, but the similarities go much further. We have attempted to adopt as much of the DataFrame API as is relevant to CAS. When a method is implemented, both the method parameters and outputs work the same way between the CASTable object and Pandas DataFrame objects. Pandas DataFrame method options are mapped to CAS action options, and the CAS action outputs are transformed to match the form of the Pandas DataFrame outputs. If you are familiar with Pandas DataFrames, this makes it much easier to get started with CAS.

5.1 Subsetting Columns

The most basic operation when dealing with DataFrames and CASTable objects is selecting columns. This is done using the [...] syntax in Python. We'll start with a Pandas DataFrame and a CASTable object that both contain the CARS data set.


In [45]:
import pandas as pd

In [46]:
df = pd.read_csv('https://raw.githubusercontent.com/'
                 'sassoftware/sas-viya-programming/master/data/cars.csv')
df.head()


Out[46]:
Make Model Type Origin DriveTrain MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length
0 Acura MDX SUV Asia All 36945.0 33337.0 3.5 6.0 265.0 17.0 23.0 4451.0 106.0 189.0
1 Acura RSX Type S 2dr Sedan Asia Front 23820.0 21761.0 2.0 4.0 200.0 24.0 31.0 2778.0 101.0 172.0
2 Acura TSX 4dr Sedan Asia Front 26990.0 24647.0 2.4 4.0 200.0 22.0 29.0 3230.0 105.0 183.0
3 Acura TL 4dr Sedan Asia Front 33195.0 30299.0 3.2 6.0 270.0 20.0 28.0 3575.0 108.0 186.0
4 Acura 3.5 RL 4dr Sedan Asia Front 43755.0 39014.0 3.5 6.0 225.0 18.0 24.0 3880.0 115.0 197.0

Here is the same data set as a CASTable.


In [47]:
tbl.head()


Out[47]:
Selected Rows from Table _T_8B910465_7F1F62C95F88
Make Model Type Origin DriveTrain MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length
0 Acura MDX SUV Asia All 36945.0 33337.0 3.5 6.0 265.0 17.0 23.0 4451.0 106.0 189.0
1 Acura RSX Type S 2dr Sedan Asia Front 23820.0 21761.0 2.0 4.0 200.0 24.0 31.0 2778.0 101.0 172.0
2 Acura TSX 4dr Sedan Asia Front 26990.0 24647.0 2.4 4.0 200.0 22.0 29.0 3230.0 105.0 183.0
3 Acura TL 4dr Sedan Asia Front 33195.0 30299.0 3.2 6.0 270.0 20.0 28.0 3575.0 108.0 186.0
4 Acura 3.5 RL 4dr Sedan Asia Front 43755.0 39014.0 3.5 6.0 225.0 18.0 24.0 3880.0 115.0 197.0

To select a subset of columns using a Pandas DataFrame, you use Python's indexing syntax. When you specify a list of column names, you get a new DataFrame with just those columns in it.


In [48]:
subdf = df[['Make', 'Model', 'MSRP', 'Cylinders', 'Horsepower']]
subdf.head()


Out[48]:
Make Model MSRP Cylinders Horsepower
0 Acura MDX 36945.0 6.0 265.0
1 Acura RSX Type S 2dr 23820.0 4.0 200.0
2 Acura TSX 4dr 26990.0 4.0 200.0
3 Acura TL 4dr 33195.0 6.0 270.0
4 Acura 3.5 RL 4dr 43755.0 6.0 225.0

The same is true of CASTable objects.


In [49]:
subtbl = tbl[['Make', 'Model', 'MSRP', 'Cylinders', 'Horsepower']]
subtbl.head()


Out[49]:
Selected Rows from Table _T_8B910465_7F1F62C95F88
Make Model MSRP Cylinders Horsepower
0 Acura MDX 36945.0 6.0 265.0
1 Acura RSX Type S 2dr 23820.0 4.0 200.0
2 Acura TSX 4dr 26990.0 4.0 200.0
3 Acura TL 4dr 33195.0 6.0 270.0
4 Acura 3.5 RL 4dr 43755.0 6.0 225.0

5.2 Sorting Results

You may or may not have gotten the same rows of data when displaying data from the Pandas DataFrame and the CASTable object. This is because CAS does not store data rows in any specified order. CAS tables can be distributed across a grid of machines, so depending on events that may happen, the rows of the table may be redistributed.

However, CASTable objects can emulate Pandas DataFrame-like sorting features by supplying sort parameters to the fetch action when data is retrieved.

Again, we'll show how sorting works with Pandas DataFrames, then show the CASTable version.


In [50]:
subdf.sort_values(['MSRP']).head()


Out[50]:
Make Model MSRP Cylinders Horsepower
206 Kia Rio 4dr manual 10280.0 4.0 104.0
168 Hyundai Accent 2dr hatch 10539.0 4.0 103.0
382 Toyota Echo 2dr manual 10760.0 4.0 108.0
345 Saturn Ion1 4dr 10995.0 4.0 140.0
207 Kia Rio 4dr auto 11155.0 4.0 104.0

In [51]:
subtbl.sort_values(['MSRP']).head()


Out[51]:
Selected Rows from Table _T_8B910465_7F1F62C95F88
Make Model MSRP Cylinders Horsepower
0 Kia Rio 4dr manual 10280.0 4.0 104.0
1 Hyundai Accent 2dr hatch 10539.0 4.0 103.0
2 Toyota Echo 2dr manual 10760.0 4.0 108.0
3 Saturn Ion1 4dr 10995.0 4.0 140.0
4 Kia Rio 4dr auto 11155.0 4.0 104.0

You'll see that we get the same rows, but not the same index values when sorting. CAS currently does not have a row index, so it gets the automatically applied sort index when the DataFrame is created on the client side.

The sort_values method of CASTable also support the ascending= parameter.


In [52]:
subtbl.sort_values(['MSRP'], ascending=False).head()


Out[52]:
Selected Rows from Table _T_8B910465_7F1F62C95F88
Make Model MSRP Cylinders Horsepower
0 Porsche 911 GT2 2dr 192465.0 6.0 477.0
1 Mercedes-Benz CL600 2dr 128420.0 12.0 493.0
2 Mercedes-Benz SL600 convertible 2dr 126670.0 12.0 493.0
3 Mercedes-Benz SL55 AMG 2dr 121770.0 8.0 493.0
4 Mercedes-Benz CL500 2dr 94820.0 8.0 302.0

5.3 Pandas-style Statistics

While there are CAS actions that give results similar to what Pandas DataFrame statistical methods give, they always require some transformation to be exactly the same form. Many of the Pandas DataFrame methods have been re-implemented in CASTables. This means that they look and act like Pandas DataFrame methods, but all of the real work is done behind-the-scenes in CAS actions.

Probably the most common statistical method in Pandas DataFrames is describe.


In [53]:
df.describe()


/opt/anaconda3/lib/python3.5/site-packages/numpy/lib/function_base.py:3834: RuntimeWarning: Invalid value encountered in percentile
  RuntimeWarning)
Out[53]:
MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length
count 428.000000 428.000000 428.000000 426.000000 428.000000 428.000000 428.000000 428.000000 428.000000 428.000000
mean 32774.855140 30014.700935 3.196729 5.807512 215.885514 20.060748 26.843458 3577.953271 108.154206 186.362150
std 19431.716674 17642.117750 1.108595 1.558443 71.836032 5.238218 5.741201 758.983215 8.311813 14.357991
min 10280.000000 9875.000000 1.300000 3.000000 73.000000 10.000000 12.000000 1850.000000 89.000000 143.000000
25% 20334.250000 18866.000000 2.375000 NaN 165.000000 17.000000 24.000000 3104.000000 103.000000 178.000000
50% 27635.000000 25294.500000 3.000000 NaN 210.000000 19.000000 26.000000 3474.500000 107.000000 187.000000
75% 39205.000000 35710.250000 3.900000 NaN 255.000000 21.250000 29.000000 3977.750000 112.000000 194.000000
max 192465.000000 173560.000000 8.300000 12.000000 500.000000 60.000000 66.000000 7190.000000 144.000000 238.000000

CASTable objects also support the describe method. As you can see, the column and row indexes are exactly the same as what you get with the Pandas DataFrame method.


In [54]:
tbl.describe()


Out[54]:
MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length
count 428.000000 428.000000 428.000000 426.000000 428.000000 428.000000 428.000000 428.000000 428.000000 428.000000
mean 32774.855140 30014.700935 3.196729 5.807512 215.885514 20.060748 26.843458 3577.953271 108.154206 186.362150
std 19431.716674 17642.117750 1.108595 1.558443 71.836032 5.238218 5.741201 758.983215 8.311813 14.357991
min 10280.000000 9875.000000 1.300000 3.000000 73.000000 10.000000 12.000000 1850.000000 89.000000 143.000000
25% 20329.500000 18851.000000 2.350000 4.000000 165.000000 17.000000 24.000000 3103.000000 103.000000 178.000000
50% 27635.000000 25294.500000 3.000000 6.000000 210.000000 19.000000 26.000000 3474.500000 107.000000 187.000000
75% 39215.000000 35732.500000 3.900000 6.000000 255.000000 21.500000 29.000000 3978.500000 112.000000 194.000000
max 192465.000000 173560.000000 8.300000 12.000000 500.000000 60.000000 66.000000 7190.000000 144.000000 238.000000

You can also specify the same options to the CASTable.describe method as DataFrames, as well as using the subsetting syntax described above.


In [55]:
tbl[['MSRP', 'EngineSize', 'Cylinders']].describe(percentiles=[0.3, 0.7])


Out[55]:
MSRP EngineSize Cylinders
count 428.000000 428.000000 426.000000
mean 32774.855140 3.196729 5.807512
std 19431.716674 1.108595 1.558443
min 10280.000000 1.300000 3.000000
30% 22000.000000 2.400000 4.000000
50% 27635.000000 3.000000 6.000000
70% 35940.000000 3.600000 6.000000
max 192465.000000 8.300000 12.000000

While the output was created to be compatible with Pandas DataFrame's describe method, CAS supplies many more statistics that can be displayed using the stats= parameter. You can specify a list of names of the statistics to display, or simply specify stats=all to display all of them.


In [56]:
tbl.describe(stats='all')


Out[56]:
MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length
count 4.280000e+02 4.280000e+02 4.280000e+02 4.260000e+02 4.280000e+02 4.280000e+02 4.280000e+02 4.280000e+02 4.280000e+02 4.280000e+02
unique 4.100000e+02 4.250000e+02 4.300000e+01 7.000000e+00 1.100000e+02 2.800000e+01 3.300000e+01 3.480000e+02 4.000000e+01 6.700000e+01
mean 3.277486e+04 3.001470e+04 3.196729e+00 5.807512e+00 2.158855e+02 2.006075e+01 2.684346e+01 3.577953e+03 1.081542e+02 1.863621e+02
std 1.943172e+04 1.764212e+04 1.108595e+00 1.558443e+00 7.183603e+01 5.238218e+00 5.741201e+00 7.589832e+02 8.311813e+00 1.435799e+01
min 1.028000e+04 9.875000e+03 1.300000e+00 3.000000e+00 7.300000e+01 1.000000e+01 1.200000e+01 1.850000e+03 8.900000e+01 1.430000e+02
25% 2.032950e+04 1.885100e+04 2.350000e+00 4.000000e+00 1.650000e+02 1.700000e+01 2.400000e+01 3.103000e+03 1.030000e+02 1.780000e+02
50% 2.763500e+04 2.529450e+04 3.000000e+00 6.000000e+00 2.100000e+02 1.900000e+01 2.600000e+01 3.474500e+03 1.070000e+02 1.870000e+02
75% 3.921500e+04 3.573250e+04 3.900000e+00 6.000000e+00 2.550000e+02 2.150000e+01 2.900000e+01 3.978500e+03 1.120000e+02 1.940000e+02
max 1.924650e+05 1.735600e+05 8.300000e+00 1.200000e+01 5.000000e+02 6.000000e+01 6.600000e+01 7.190000e+03 1.440000e+02 2.380000e+02
nmiss 0.000000e+00 0.000000e+00 0.000000e+00 2.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
sum 1.402764e+07 1.284629e+07 1.368200e+03 2.474000e+03 9.239900e+04 8.586000e+03 1.148900e+04 1.531364e+06 4.629000e+04 7.976300e+04
stderr 9.392675e+02 8.527639e+02 5.358595e-02 7.550679e-02 3.472326e+00 2.531988e-01 2.775114e-01 3.668684e+01 4.017666e-01 6.940197e-01
var 3.775916e+08 3.112443e+08 1.228982e+00 2.428743e+00 5.160415e+03 2.743892e+01 3.296139e+01 5.760555e+05 6.908624e+01 2.061519e+02
uss 6.209854e+11 5.184789e+11 4.898540e+03 1.540000e+04 2.215110e+07 1.839580e+05 3.224790e+05 5.725125e+09 5.035958e+06 1.495283e+07
cv 5.928849e+01 5.877826e+01 3.467903e+01 2.683495e+01 3.327506e+01 2.611178e+01 2.138771e+01 2.121278e+01 7.685150e+00 7.704349e+00
tvalue 3.489406e+01 3.519696e+01 5.965611e+01 7.691377e+01 6.217318e+01 7.922923e+01 9.672920e+01 9.752689e+01 2.691966e+02 2.685257e+02
probt 4.160412e-127 2.684398e-128 3.133745e-209 1.515569e-251 4.185344e-216 1.866284e-257 1.665621e-292 5.812547e-294 0.000000e+00 0.000000e+00

If you want a full-blown table with all statistics and all variables (numeric and character), you can specify include='all' and stats='all' as paramaters.


In [57]:
tbl.describe(include='all', stats='all')


Out[57]:
Make Model Type Origin DriveTrain MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length
count 428 428 428 428 428 428 428 428 426 428 428 428 428 428 428
unique 38 425 6 3 3 410 425 43 7 110 28 33 348 40 67
top Toyota G35 4dr Sedan Asia Front 74995 68306 3 6 200 18 26 3450 107 178
freq 28 2 262 158 226 2 2 42 190 17 69 54 4 45 27
mean NaN NaN NaN NaN NaN 32774.9 30014.7 3.19673 5.80751 215.886 20.0607 26.8435 3577.95 108.154 186.362
std NaN NaN NaN NaN NaN 19431.7 17642.1 1.10859 1.55844 71.836 5.23822 5.7412 758.983 8.31181 14.358
min Acura 3.5 RL 4dr Hybrid Asia All 10280 9875 1.3 3 73 10 12 1850 89 143
25% NaN NaN NaN NaN NaN 20329.5 18851 2.35 4 165 17 24 3103 103 178
50% NaN NaN NaN NaN NaN 27635 25294.5 3 6 210 19 26 3474.5 107 187
75% NaN NaN NaN NaN NaN 39215 35732.5 3.9 6 255 21.5 29 3978.5 112 194
max Volvo Z4 convertible 3.0i 2dr Wagon USA Rear 192465 173560 8.3 12 500 60 66 7190 144 238
nmiss 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0
sum NaN NaN NaN NaN NaN 1.40276e+07 1.28463e+07 1368.2 2474 92399 8586 11489 1.53136e+06 46290 79763
stderr NaN NaN NaN NaN NaN 939.267 852.764 0.0535859 0.0755068 3.47233 0.253199 0.277511 36.6868 0.401767 0.69402
var NaN NaN NaN NaN NaN 3.77592e+08 3.11244e+08 1.22898 2.42874 5160.42 27.4389 32.9614 576056 69.0862 206.152
uss NaN NaN NaN NaN NaN 6.20985e+11 5.18479e+11 4898.54 15400 2.21511e+07 183958 322479 5.72512e+09 5.03596e+06 1.49528e+07
cv NaN NaN NaN NaN NaN 59.2885 58.7783 34.679 26.8349 33.2751 26.1118 21.3877 21.2128 7.68515 7.70435
tvalue NaN NaN NaN NaN NaN 34.8941 35.197 59.6561 76.9138 62.1732 79.2292 96.7292 97.5269 269.197 268.526
probt NaN NaN NaN NaN NaN 4.16041e-127 2.6844e-128 3.13374e-209 1.51557e-251 4.18534e-216 1.86628e-257 1.66562e-292 5.81255e-294 0 0

As we showed earlier, you can specify By groups using the groupby method on both DataFrames and CASTables. This also works in conjunction with DataFrame methods like describe.


In [58]:
df.groupby('Origin').describe()


/opt/anaconda3/lib/python3.5/site-packages/numpy/lib/function_base.py:3834: RuntimeWarning: Invalid value encountered in percentile
  RuntimeWarning)
Out[58]:
Cylinders EngineSize Horsepower Invoice Length MPG_City MPG_Highway MSRP Weight Wheelbase
Origin
Asia count 156.000000 158.000000 158.000000 158.000000 158.000000 158.000000 158.000000 158.000000 158.000000 158.000000
mean 5.185897 2.774051 190.702532 22602.177215 182.816456 22.012658 28.265823 24741.322785 3319.316456 105.886076
std 1.269008 0.902310 59.392627 9842.984880 12.564148 6.733307 6.770503 11321.069675 717.842132 7.735249
min 3.000000 1.300000 73.000000 9875.000000 153.000000 13.000000 17.000000 10280.000000 1850.000000 89.000000
25% NaN 2.000000 142.000000 16271.500000 175.000000 18.000000 25.000000 17208.000000 2751.500000 102.000000
50% NaN 2.600000 187.500000 20949.500000 183.500000 20.500000 27.000000 23032.500000 3280.000000 105.000000
75% NaN 3.500000 233.750000 26645.000000 191.000000 24.000000 31.000000 28787.500000 3706.750000 110.000000
max 8.000000 5.600000 340.000000 79978.000000 224.000000 60.000000 66.000000 89765.000000 5590.000000 140.000000
Europe count 123.000000 123.000000 123.000000 123.000000 123.000000 123.000000 123.000000 123.000000 123.000000 123.000000
mean 6.235772 3.206504 251.894309 44395.081301 181.845528 18.731707 26.008130 48349.796748 3680.723577 106.447154
std 1.679585 1.037157 80.738884 23080.368888 11.479870 3.289509 4.167588 25318.600464 579.576871 6.690099
min 4.000000 1.600000 100.000000 15437.000000 143.000000 12.000000 14.000000 16999.000000 2524.000000 93.000000
25% 5.000000 2.500000 195.500000 31287.500000 177.000000 17.000000 24.000000 33837.500000 3285.000000 102.000000
50% 6.000000 3.000000 225.000000 37575.000000 182.000000 19.000000 26.000000 40590.000000 3585.000000 107.000000
75% 8.000000 4.200000 302.000000 51042.500000 190.000000 20.000000 28.500000 56382.500000 3959.500000 110.000000
max 12.000000 6.000000 493.000000 173560.000000 204.000000 38.000000 46.000000 192465.000000 5423.000000 123.000000
USA count 147.000000 147.000000 147.000000 147.000000 147.000000 147.000000 147.000000 147.000000 147.000000 147.000000
mean 6.108844 3.642857 212.823129 25949.340136 193.952381 19.074830 26.013605 28377.442177 3769.952381 112.020408
std 1.531090 1.194221 63.748618 10518.722194 15.305265 3.982992 5.396582 11711.982506 855.305524 8.788590
min 4.000000 1.600000 103.000000 10319.000000 150.000000 10.000000 12.000000 10995.000000 2348.000000 93.000000
25% 5.000000 2.700000 160.000000 18927.000000 185.000000 17.000000 22.000000 20315.000000 3234.000000 106.000000
50% 6.000000 3.600000 200.000000 23217.000000 194.000000 18.000000 26.000000 25520.000000 3606.000000 111.000000
75% 8.000000 4.600000 250.000000 30778.000000 201.500000 21.000000 29.000000 33767.500000 4235.000000 115.000000
max 10.000000 8.300000 500.000000 74451.000000 238.000000 29.000000 37.000000 81795.000000 7190.000000 144.000000

Here is the CASTable version. Notice that in this case, since compatibility with Pandas was the primary concern, the By groups are all contained in the same DataFrame. Calling CAS actions directly will always split them into separate tables.


In [59]:
tbl.groupby('Origin').describe()


Out[59]:
MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length
Origin
Asia count 158.000000 158.000000 158.000000 156.000000 158.000000 158.000000 158.000000 158.000000 158.000000 158.000000
mean 24741.322785 22602.177215 2.774051 5.185897 190.702532 22.012658 28.265823 3319.316456 105.886076 182.816456
std 11321.069675 9842.984880 0.902310 1.269008 59.392627 6.733307 6.770503 717.842132 7.735249 12.564148
min 10280.000000 9875.000000 1.300000 3.000000 73.000000 13.000000 17.000000 1850.000000 89.000000 153.000000
25% 17200.000000 16265.000000 2.000000 4.000000 142.000000 18.000000 25.000000 2750.000000 102.000000 175.000000
50% 23032.500000 20949.500000 2.600000 6.000000 187.500000 20.500000 27.000000 3280.000000 105.000000 183.500000
75% 28800.000000 26660.000000 3.500000 6.000000 235.000000 24.000000 31.000000 3715.000000 110.000000 191.000000
max 89765.000000 79978.000000 5.600000 8.000000 340.000000 60.000000 66.000000 5590.000000 140.000000 224.000000
Europe count 123.000000 123.000000 123.000000 123.000000 123.000000 123.000000 123.000000 123.000000 123.000000 123.000000
mean 48349.796748 44395.081301 3.206504 6.235772 251.894309 18.731707 26.008130 3680.723577 106.447154 181.845528
std 25318.600464 23080.368888 1.037157 1.679585 80.738884 3.289509 4.167588 579.576871 6.690099 11.479870
min 16999.000000 15437.000000 1.600000 4.000000 100.000000 12.000000 14.000000 2524.000000 93.000000 143.000000
25% 33780.000000 31187.000000 2.500000 5.000000 194.000000 17.000000 24.000000 3285.000000 102.000000 177.000000
50% 40590.000000 37575.000000 3.000000 6.000000 225.000000 19.000000 26.000000 3585.000000 107.000000 182.000000
75% 56595.000000 51815.000000 4.200000 8.000000 302.000000 20.000000 29.000000 3966.000000 110.000000 190.000000
max 192465.000000 173560.000000 6.000000 12.000000 493.000000 38.000000 46.000000 5423.000000 123.000000 204.000000
USA count 147.000000 147.000000 147.000000 147.000000 147.000000 147.000000 147.000000 147.000000 147.000000 147.000000
mean 28377.442177 25949.340136 3.642857 6.108844 212.823129 19.074830 26.013605 3769.952381 112.020408 193.952381
std 11711.982506 10518.722194 1.194221 1.531090 63.748618 3.982992 5.396582 855.305524 8.788590 15.305265
min 10995.000000 10319.000000 1.600000 4.000000 103.000000 10.000000 12.000000 2348.000000 93.000000 150.000000
25% 20310.000000 18881.000000 2.700000 4.000000 155.000000 17.000000 22.000000 3222.000000 106.000000 185.000000
50% 25520.000000 23217.000000 3.600000 6.000000 200.000000 18.000000 26.000000 3606.000000 111.000000 194.000000
75% 33995.000000 30846.000000 4.600000 8.000000 250.000000 21.000000 29.000000 4275.000000 115.000000 202.000000
max 81795.000000 74451.000000 8.300000 10.000000 500.000000 29.000000 37.000000 7190.000000 144.000000 238.000000

5.4 Filtering Data

Sometimes you want to filter your data so that only part of the data set is analyzed. There are a few ways to do this in SWAT.

  1. Use the where= parameter on the table= parameter
  2. Use the query(...) method of CASTable
  3. Use Pandas-style expression indexing on CASTable objects

Essentially these are all equivalent. They all end up setting a where= parameter on the table= parameter, so it's more personal preference as to which one you choose.

The first method is the brute-force method: setting the where= parameter on the table. One thing to keep in mind is that the expression must be a CAS WHERE clause, not a Python expression. This may trip people up because = is used to test equality in CAS, whereas == is used in Python.


In [60]:
tbl.set_param(where='Origin = "Asia"')
tbl


Out[60]:
CASTable('_T_8B910465_7F1F62C95F88', caslib='CASUSER(viyauser)', where='Origin = "Asia"')

In [61]:
tbl.head()


Out[61]:
Selected Rows from Table _T_8B910465_7F1F62C95F88
Make Model Type Origin DriveTrain MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length
0 Acura MDX SUV Asia All 36945.0 33337.0 3.5 6.0 265.0 17.0 23.0 4451.0 106.0 189.0
1 Acura RSX Type S 2dr Sedan Asia Front 23820.0 21761.0 2.0 4.0 200.0 24.0 31.0 2778.0 101.0 172.0
2 Acura TSX 4dr Sedan Asia Front 26990.0 24647.0 2.4 4.0 200.0 22.0 29.0 3230.0 105.0 183.0
3 Acura TL 4dr Sedan Asia Front 33195.0 30299.0 3.2 6.0 270.0 20.0 28.0 3575.0 108.0 186.0
4 Acura 3.5 RL 4dr Sedan Asia Front 43755.0 39014.0 3.5 6.0 225.0 18.0 24.0 3880.0 115.0 197.0

Using the query method on CASTable is equivalent. The only difference is that it creates a copy of the CASTable object and returns the copy with the where= parameter set. Since tbl already had a where= parameter set, the conditions get combined.


In [62]:
tbl = tbl.query('Cylinders = 6')
tbl


Out[62]:
CASTable('_T_8B910465_7F1F62C95F88', caslib='CASUSER(viyauser)', where='(Origin = "Asia") and (Cylinders = 6)')

In [63]:
tbl.head()


Out[63]:
Selected Rows from Table _T_8B910465_7F1F62C95F88
Make Model Type Origin DriveTrain MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length
0 Acura MDX SUV Asia All 36945.0 33337.0 3.5 6.0 265.0 17.0 23.0 4451.0 106.0 189.0
1 Acura TL 4dr Sedan Asia Front 33195.0 30299.0 3.2 6.0 270.0 20.0 28.0 3575.0 108.0 186.0
2 Acura 3.5 RL 4dr Sedan Asia Front 43755.0 39014.0 3.5 6.0 225.0 18.0 24.0 3880.0 115.0 197.0
3 Acura 3.5 RL w/Navigation 4dr Sedan Asia Front 46100.0 41100.0 3.5 6.0 225.0 18.0 24.0 3893.0 115.0 197.0
4 Acura NSX coupe 2dr manual S Sports Asia Rear 89765.0 79978.0 3.2 6.0 290.0 17.0 24.0 3153.0 100.0 174.0

Finally, you can use indexing expressions just like in Pandas DataFrames. This style of indexing creates a computed column for the expression, then adds the result of that computed column as a condition of the where= parameter.


In [64]:
tbl = tbl[tbl.MSRP > 50000]
tbl


Out[64]:
CASTable('_T_8B910465_7F1F62C95F88', caslib='CASUSER(viyauser)', computedvars=['_gt_0_'], computedvarsprogram='_gt_0_ = (MSRP > 50000); ', where='((Origin = "Asia") and (Cylinders = 6)) and (_gt_0_)')[['Make', 'Model', 'Type', 'Origin', 'DriveTrain', 'MSRP', 'Invoice', 'EngineSize', 'Cylinders', 'Horsepower', 'MPG_City', 'MPG_Highway', 'Weight', 'Wheelbase', 'Length']]

You may notice that the string representation of the table now includes all of the column names. This is due to the fact there there is a computed column in the table now which automatically get appended to the table, so in order for that not to show up in the output, we must select the columns that we want to display.


In [65]:
tbl.head()


Out[65]:
Selected Rows from Table _T_8B910465_7F1F62C95F88
Make Model Type Origin DriveTrain MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length
0 Acura NSX coupe 2dr manual S Sports Asia Rear 89765.0 79978.0 3.2 6.0 290.0 17.0 24.0 3153.0 100.0 174.0

Exercise 6 - Doing Simple Statistics

There are various CAS actions for doing some basic statistical analyses. These are located in the simple action set.


In [66]:
conn.tableinfo()


Out[66]:
§ TableInfo
Name Rows Columns Encoding CreateTimeFormatted ModTimeFormatted JavaCharSet CreateTime ModTime Global Repeated View SourceName SourceCaslib Compressed Creator Modifier
0 _T_8B910465_7F1F62C95F88 428 15 utf-8 09Sep2016:13:45:32 09Sep2016:13:45:32 UTF8 1.789048e+09 1.789048e+09 0 0 0 _T_8B910465_7F1F62C95F88 0 viyauser
1 CARS 428 15 utf-8 09Sep2016:13:45:32 09Sep2016:13:45:32 UTF8 1.789048e+09 1.789048e+09 0 0 0 0 viyauser
2 DATA.CARS 428 15 utf-8 09Sep2016:13:45:34 09Sep2016:13:45:34 UTF8 1.789048e+09 1.789048e+09 0 0 0 Data/cars.csv CASUSER(viyauser) 0 viyauser

elapsed 0.000718s · user 0.001s · mem 0.0709MB

Let's create a fresh CASTable object.


In [67]:
tbl = conn.CASTable('cars')

In [68]:
conn.simple?

6.1 Distinct Values

Let's start off with the distinct action. It takes the same table= parameter as the actions above, so we can call it directly on the CASTable object and have the table= parameter filled automatically.

distinct action outputs number of distinct values and number of missing values for each column in the table.


In [69]:
tbl.distinct()


Out[69]:
§ Distinct
Distinct Counts for CARS
Column NDistinct NMiss Trunc
0 Make 38.0 0.0 0.0
1 Model 425.0 0.0 0.0
2 Type 6.0 0.0 0.0
3 Origin 3.0 0.0 0.0
4 DriveTrain 3.0 0.0 0.0
5 MSRP 410.0 0.0 0.0
6 Invoice 425.0 0.0 0.0
7 EngineSize 43.0 0.0 0.0
8 Cylinders 8.0 2.0 0.0
9 Horsepower 110.0 0.0 0.0
10 MPG_City 28.0 0.0 0.0
11 MPG_Highway 33.0 0.0 0.0
12 Weight 348.0 0.0 0.0
13 Wheelbase 40.0 0.0 0.0
14 Length 67.0 0.0 0.0

elapsed 0.00605s · user 0.008s · mem 2.33MB

6.2 Univariate Summary Statistics

The summary action outputs univariate summary statistics.


In [70]:
tbl.summary()


Out[70]:
§ Summary
Descriptive Statistics for CARS
Column Min Max N NMiss Mean Sum Std StdErr Var USS CSS CV TValue ProbT
0 MSRP 10280.0 192465.0 428.0 0.0 32774.855140 14027638.0 19431.716674 939.267478 3.775916e+08 6.209854e+11 1.612316e+11 59.288490 34.894059 4.160412e-127
1 Invoice 9875.0 173560.0 428.0 0.0 30014.700935 12846292.0 17642.117750 852.763949 3.112443e+08 5.184789e+11 1.329013e+11 58.778256 35.196963 2.684398e-128
2 EngineSize 1.3 8.3 428.0 0.0 3.196729 1368.2 1.108595 0.053586 1.228982e+00 4.898540e+03 5.247754e+02 34.679034 59.656105 3.133745e-209
3 Cylinders 3.0 12.0 426.0 2.0 5.807512 2474.0 1.558443 0.075507 2.428743e+00 1.540000e+04 1.032216e+03 26.834946 76.913766 1.515569e-251
4 Horsepower 73.0 500.0 428.0 0.0 215.885514 92399.0 71.836032 3.472326 5.160415e+03 2.215110e+07 2.203497e+06 33.275059 62.173176 4.185344e-216
5 MPG_City 10.0 60.0 428.0 0.0 20.060748 8586.0 5.238218 0.253199 2.743892e+01 1.839580e+05 1.171642e+04 26.111777 79.229235 1.866284e-257
6 MPG_Highway 12.0 66.0 428.0 0.0 26.843458 11489.0 5.741201 0.277511 3.296139e+01 3.224790e+05 1.407451e+04 21.387709 96.729204 1.665621e-292
7 Weight 1850.0 7190.0 428.0 0.0 3577.953271 1531364.0 758.983215 36.686838 5.760555e+05 5.725125e+09 2.459757e+08 21.212776 97.526890 5.812547e-294
8 Wheelbase 89.0 144.0 428.0 0.0 108.154206 46290.0 8.311813 0.401767 6.908624e+01 5.035958e+06 2.949982e+04 7.685150 269.196577 0.000000e+00
9 Length 143.0 238.0 428.0 0.0 186.362150 79763.0 14.357991 0.694020 2.061519e+02 1.495283e+07 8.802687e+04 7.704349 268.525733 0.000000e+00

elapsed 0.0026s · user 0.001s · sys 0.002s · mem 0.668MB

You have two ways to specifiy the columns to generate summary statistics.


In [71]:
tbl[['MSRP','MPG_CITY','MPG_Highway']].summary()


Out[71]:
§ Summary
Descriptive Statistics for CARS
Column Min Max N NMiss Mean Sum Std StdErr Var USS CSS CV TValue ProbT
0 MSRP 10280.0 192465.0 428.0 0.0 32774.855140 14027638.0 19431.716674 939.267478 3.775916e+08 6.209854e+11 1.612316e+11 59.288490 34.894059 4.160412e-127
1 MPG_City 10.0 60.0 428.0 0.0 20.060748 8586.0 5.238218 0.253199 2.743892e+01 1.839580e+05 1.171642e+04 26.111777 79.229235 1.866284e-257
2 MPG_Highway 12.0 66.0 428.0 0.0 26.843458 11489.0 5.741201 0.277511 3.296139e+01 3.224790e+05 1.407451e+04 21.387709 96.729204 1.665621e-292

elapsed 0.00307s · user 0.001s · sys 0.001s · mem 0.667MB


In [72]:
tbl.summary(inputs=['MSRP','MPG_CITY','MPG_Highway'])


Out[72]:
§ Summary
Descriptive Statistics for CARS
Column Min Max N NMiss Mean Sum Std StdErr Var USS CSS CV TValue ProbT
0 MSRP 10280.0 192465.0 428.0 0.0 32774.855140 14027638.0 19431.716674 939.267478 3.775916e+08 6.209854e+11 1.612316e+11 59.288490 34.894059 4.160412e-127
1 MPG_City 10.0 60.0 428.0 0.0 20.060748 8586.0 5.238218 0.253199 2.743892e+01 1.839580e+05 1.171642e+04 26.111777 79.229235 1.866284e-257
2 MPG_Highway 12.0 66.0 428.0 0.0 26.843458 11489.0 5.741201 0.277511 3.296139e+01 3.224790e+05 1.407451e+04 21.387709 96.729204 1.665621e-292

elapsed 0.00363s · user 0.002s · sys 0.000999s · mem 0.667MB

6.3 One-way Frequency Table

the freq action is similar to PROC FREQ


In [73]:
tbl[['Origin','Cylinders']].freq()


Out[73]:
§ Frequency
Frequency for CARS
Column NumVar CharVar FmtVar Level Frequency
0 Origin NaN Asia Asia 1 158.0
1 Origin NaN Europe Europe 2 123.0
2 Origin NaN USA USA 3 147.0
3 Cylinders NaN . 1 2.0
4 Cylinders 3.0 3 2 1.0
5 Cylinders 4.0 4 3 136.0
6 Cylinders 5.0 5 4 7.0
7 Cylinders 6.0 6 5 190.0
8 Cylinders 8.0 8 6 87.0
9 Cylinders 10.0 10 7 2.0
10 Cylinders 12.0 12 8 3.0

elapsed 0.0047s · user 0.001s · sys 0.001s · mem 0.694MB


In [74]:
import cufflinks
cufflinks.go_offline()

output = tbl[['Origin','Cylinders']].freq()
df = output['Frequency']

df[df['Column'] == 'Cylinders'].iplot(kind='bar', x='FmtVar', y='Frequency')


6.4 Two-way Frequency Table and Chi-square Test

Unlike PROC FREQ, the freq action only generates one-way frequency tables. You need the crosstab action to generate two-way or multi-way frequency tables. It also computes Chi-square statistics and measure of associations.


In [75]:
tbl.crosstab(row="Cylinders", col='Origin')


Out[75]:
§ Crosstab
Cylinders Col1 Col2 Col3
0 3 1.0 0.0 0.0
1 4 74.0 25.0 37.0
2 5 0.0 7.0 0.0
3 6 69.0 54.0 67.0
4 8 12.0 34.0 41.0
5 10 0.0 0.0 2.0
6 12 0.0 3.0 0.0

elapsed 0.00235s · user 0.002s · sys 0.001s · mem 0.64MB


In [76]:
tbl.crosstab(row="MSRP", col='Origin', rowNBins=10)


Out[76]:
§ Crosstab
MSRP Col1 Col2 Col3
0 (-12500, 12500] 11.0 0.0 2.0
1 ( 12500, 37500] 134.0 47.0 117.0
2 ( 37500, 62500] 10.0 50.0 26.0
3 ( 62500, 87500] 2.0 20.0 2.0
4 ( 87500, 112500] 1.0 2.0 0.0
5 (112500, 137500] 0.0 3.0 0.0
6 (187500, 212500] 0.0 1.0 0.0

elapsed 0.00265s · user 0.003s · mem 0.639MB


In [77]:
tbl.crosstab(row="MSRP", col='Origin', rowNBins=10, chisq=True)


Out[77]:
§ Crosstab
MSRP Col1 Col2 Col3
0 (-12500, 12500] 11.0 0.0 2.0
1 ( 12500, 37500] 134.0 47.0 117.0
2 ( 37500, 62500] 10.0 50.0 26.0
3 ( 62500, 87500] 2.0 20.0 2.0
4 ( 87500, 112500] 1.0 2.0 0.0
5 (112500, 137500] 0.0 3.0 0.0
6 (187500, 212500] 0.0 1.0 0.0

§ ChiSq
Statistic DF Value Prob
0 Chi-Square 12.0 126.453870 3.172351e-21
1 Likelihood Ratio Chi-Square 12.0 127.634505 1.840164e-21

elapsed 0.0024s · user 0.002s · sys 0.001s · mem 0.639MB

6.5 Pearson's Correlation Coefficient

In Statistics, the Pearson correlation coefficient is a measure of the linear correlation between two variables X and Y. Its range is between +1 and −1, where 1 is total positive correlation, 0 is no correlation, and −1 is total negative correlation.

https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient


In [78]:
tbl.correlation(simple=False)


Out[78]:
§ Correlation
Pearson Correlation Coefficients for CARS
Variable MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length
0 MSRP 1.000000 0.999132 0.571753 0.649742 0.826945 -0.475020 -0.439622 0.448426 0.152000 0.172037
1 Invoice 0.999132 1.000000 0.564498 0.645226 0.823746 -0.470442 -0.434585 0.442332 0.148328 0.166586
2 EngineSize 0.571753 0.564498 1.000000 0.908002 0.787435 -0.709471 -0.717302 0.807867 0.636517 0.637448
3 Cylinders 0.649742 0.645226 0.908002 1.000000 0.810341 -0.684402 -0.676100 0.742209 0.546730 0.547783
4 Horsepower 0.826945 0.823746 0.787435 0.810341 1.000000 -0.676699 -0.647195 0.630796 0.387398 0.381554
5 MPG_City -0.475020 -0.470442 -0.709471 -0.684402 -0.676699 1.000000 0.941021 -0.737966 -0.507284 -0.501526
6 MPG_Highway -0.439622 -0.434585 -0.717302 -0.676100 -0.647195 0.941021 1.000000 -0.790989 -0.524661 -0.466092
7 Weight 0.448426 0.442332 0.807867 0.742209 0.630796 -0.737966 -0.790989 1.000000 0.760703 0.690021
8 Wheelbase 0.152000 0.148328 0.636517 0.546730 0.387398 -0.507284 -0.524661 0.760703 1.000000 0.889195
9 Length 0.172037 0.166586 0.637448 0.547783 0.381554 -0.501526 -0.466092 0.690021 0.889195 1.000000

elapsed 0.00279s · user 0.001s · sys 0.001s · mem 0.66MB


In [79]:
output = tbl.correlation(simple=False)['Correlation']

output.set_index(output['Variable']).iplot(kind='heatmap', colorscale='rdbu')


6.6 Histogram

Let's look at some output from the histogram action.


In [80]:
tbl[['MPG_City']].histogram()


Out[80]:
§ BinDetails
Bin Details for CARS
Variable BinSetId BinId BinLowerBnd BinUpperBnd BinWidth NInBin Mean Std Min Max MidPoint Percent
0 MPG_City 1 1 7.5 12.500000 5.000000 6 11.333333 1.032796 10.0 12.0 10.0 1.401869
1 MPG_City 1 2 12.5 17.500000 5.000000 114 15.666667 1.348002 13.0 17.0 15.0 26.635514
2 MPG_City 1 3 17.5 22.500000 5.000000 219 19.538813 1.314039 18.0 22.0 20.0 51.168224
3 MPG_City 1 4 22.5 27.500000 5.000000 61 24.803279 1.107846 23.0 27.0 25.0 14.252336
4 MPG_City 1 5 27.5 32.500000 5.000000 20 29.900000 1.713722 28.0 32.0 30.0 4.672897
5 MPG_City 1 6 32.5 37.500000 5.000000 4 34.750000 1.258306 33.0 36.0 35.0 0.934579
6 MPG_City 1 7 37.5 42.500000 5.000000 1 38.000000 NaN 38.0 38.0 40.0 0.233645
7 MPG_City 1 8 42.5 47.500000 5.000000 1 46.000000 NaN 46.0 46.0 45.0 0.233645
8 MPG_City 1 9 47.5 52.500000 5.000000 0 NaN NaN NaN NaN 50.0 0.000000
9 MPG_City 1 10 52.5 57.500000 5.000000 0 NaN NaN NaN NaN 55.0 0.000000
10 MPG_City 1 11 57.5 62.500001 5.000001 2 59.500000 0.707107 59.0 60.0 60.0 0.467290

elapsed 0.307s · user 0.002s · sys 0.004s · mem 0.704MB

While the table of data is nice and may be useful in some contexts, it might be nice to see a visual representation of the histogram data. There are various ways of plotting data in Python. The most basic is to use the plot method on the DataFrame.


In [81]:
# Set matplotlib to render graphics inline
%matplotlib inline

hist = tbl[['MPG_City']].histogram()

hist['BinDetails'].plot.bar(x='MidPoint', y='Percent')


Out[81]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa2602c3470>

Exercise 7 - Grouping Data

7.1 Group-by

Now that we've done some basic statistics, let's look at how to group data. By grouping is handled by a parameter on the table: groupby. The brute force way of adding grouping to a table is to set it manually. The code below sets the groupby parameter to 'Origin'.


In [82]:
tbl.set_param(groupby='Origin')
tbl


Out[82]:
CASTable('cars', groupby='Origin')

We can then run the same summary action call as before and we will get the results grouped by Origin.


In [83]:
output = tbl[['MSRP','MPG_CITY','MPG_Highway']].summary()
output


Out[83]:
§ ByGroupInfo
ByGroupInfo
Origin Origin_f _key_
0 Asia Asia Asia
1 Europe Europe Europe
2 USA USA USA

§ ByGroup1.Summary
Descriptive Statistics for CARS
Column Min Max N NMiss Mean Sum Std StdErr Var USS CSS CV TValue ProbT
Origin
Asia MSRP 10280.0 89765.0 158.0 0.0 24741.322785 3909129.0 11321.069675 900.655944 1.281666e+08 1.168392e+11 2.012216e+10 45.757738 27.470338 7.525437e-62
Asia MPG_City 13.0 60.0 158.0 0.0 22.012658 3478.0 6.733307 0.535673 4.533742e+01 8.367800e+04 7.117975e+03 30.588340 41.093453 6.417174e-86
Asia MPG_Highway 17.0 66.0 158.0 0.0 28.265823 4466.0 6.770503 0.538632 4.583972e+01 1.334320e+05 7.196835e+03 23.952968 52.477026 1.853048e-101

§ ByGroup2.Summary
Descriptive Statistics for CARS
Column Min Max N NMiss Mean Sum Std StdErr Var USS CSS CV TValue ProbT
Origin
Europe MSRP 16999.0 192465.0 123.0 0.0 48349.796748 5947025.0 25318.600464 2282.901323 6.410315e+08 3.657433e+11 7.820585e+10 52.365474 21.179101 1.104072e-42
Europe MPG_City 12.0 38.0 123.0 0.0 18.731707 2304.0 3.289509 0.296605 1.082087e+01 4.447800e+04 1.320146e+03 17.561183 63.153699 4.846098e-95
Europe MPG_Highway 14.0 46.0 123.0 0.0 26.008130 3199.0 4.167588 0.375779 1.736879e+01 8.531900e+04 2.118992e+03 16.024172 69.211292 9.190472e-100

§ ByGroup3.Summary
Descriptive Statistics for CARS
Column Min Max N NMiss Mean Sum Std StdErr Var USS CSS CV TValue ProbT
Origin
USA MSRP 10995.0 81795.0 147.0 0.0 28377.442177 4171484.0 11711.982506 965.988036 1.371705e+08 1.384029e+11 2.002690e+10 41.272157 29.376598 3.686853e-63
USA MPG_City 10.0 29.0 147.0 0.0 19.074830 2804.0 3.982992 0.328512 1.586423e+01 5.580200e+04 2.316177e+03 20.880878 58.064396 8.933082e-103
USA MPG_Highway 12.0 37.0 147.0 0.0 26.013605 3824.0 5.396582 0.445103 2.912310e+01 1.037280e+05 4.251973e+03 20.745230 58.444064 3.586575e-103

elapsed 0.00266s · user 0.001s · sys 0.001s · mem 0.672MB

You'll notice that we get each By group in a separate result key. This is done to make the output as flexible as possible. When dealing with large sets of data, it is possible that you could end up with very large result sets coming back from CAS. The By groups are split so that it is possible to handle them piecemeal rather than as one big table. We will not go into detail on the way to construct action calls to handle results individually, but you can go to the SWAT documentation for more information.

In cases like this where the data is fairly small, you can concatenate all of the By group tables using the concat_bygroups method on the CASResults object. This gives you results similar to the non-By group code, but adds an index to the DataFrame that contains the By values.


In [84]:
output.concat_bygroups()


Out[84]:
§ Summary
Descriptive Statistics for CARS
Column Min Max N NMiss Mean Sum Std StdErr Var USS CSS CV TValue ProbT
Origin
Asia MSRP 10280.0 89765.0 158.0 0.0 24741.322785 3909129.0 11321.069675 900.655944 1.281666e+08 1.168392e+11 2.012216e+10 45.757738 27.470338 7.525437e-62
Asia MPG_City 13.0 60.0 158.0 0.0 22.012658 3478.0 6.733307 0.535673 4.533742e+01 8.367800e+04 7.117975e+03 30.588340 41.093453 6.417174e-86
Asia MPG_Highway 17.0 66.0 158.0 0.0 28.265823 4466.0 6.770503 0.538632 4.583972e+01 1.334320e+05 7.196835e+03 23.952968 52.477026 1.853048e-101
Europe MSRP 16999.0 192465.0 123.0 0.0 48349.796748 5947025.0 25318.600464 2282.901323 6.410315e+08 3.657433e+11 7.820585e+10 52.365474 21.179101 1.104072e-42
Europe MPG_City 12.0 38.0 123.0 0.0 18.731707 2304.0 3.289509 0.296605 1.082087e+01 4.447800e+04 1.320146e+03 17.561183 63.153699 4.846098e-95
Europe MPG_Highway 14.0 46.0 123.0 0.0 26.008130 3199.0 4.167588 0.375779 1.736879e+01 8.531900e+04 2.118992e+03 16.024172 69.211292 9.190472e-100
USA MSRP 10995.0 81795.0 147.0 0.0 28377.442177 4171484.0 11711.982506 965.988036 1.371705e+08 1.384029e+11 2.002690e+10 41.272157 29.376598 3.686853e-63
USA MPG_City 10.0 29.0 147.0 0.0 19.074830 2804.0 3.982992 0.328512 1.586423e+01 5.580200e+04 2.316177e+03 20.880878 58.064396 8.933082e-103
USA MPG_Highway 12.0 37.0 147.0 0.0 26.013605 3824.0 5.396582 0.445103 2.912310e+01 1.037280e+05 4.251973e+03 20.745230 58.444064 3.586575e-103

Because SAS uses formatted values to do By grouping and Pandas typically works with raw values, there is a method called reshape_bygroups on the SASDataFrame object that allows you to change the representation of By groups in your DataFrames. You can indicate that you want By values (raw or formatted) as columns, an index, or neither. You can also specify a suffix to use when name collisions occur in column names if you have an action that can utilize the same column multiple times in the output.

7.2 Using Pandas-style Grouping

When we started this section, we mentioned that using the set_param method was the brute force way of setting By groupings. You can also use the Pandas-style way of setting groupings: the groupby method.

Using this returns a CASTable-like object with the groupby parameter set. All action calls on the new object behave just like a CASTable running the same action.


In [85]:
tbl.del_param('groupby')
tbl


Out[85]:
CASTable('cars')

In [86]:
output = tbl[['MSRP','MPG_City']].groupby('Origin').summary()
output


Out[86]:
§ ByGroupInfo
ByGroupInfo
Origin Origin_f _key_
0 Asia Asia Asia
1 Europe Europe Europe
2 USA USA USA

§ ByGroup1.Summary
Descriptive Statistics for CARS
Column Min Max N NMiss Mean Sum Std StdErr Var USS CSS CV TValue ProbT
Origin
Asia MSRP 10280.0 89765.0 158.0 0.0 24741.322785 3909129.0 11321.069675 900.655944 1.281666e+08 1.168392e+11 2.012216e+10 45.757738 27.470338 7.525437e-62
Asia MPG_City 13.0 60.0 158.0 0.0 22.012658 3478.0 6.733307 0.535673 4.533742e+01 8.367800e+04 7.117975e+03 30.588340 41.093453 6.417174e-86

§ ByGroup2.Summary
Descriptive Statistics for CARS
Column Min Max N NMiss Mean Sum Std StdErr Var USS CSS CV TValue ProbT
Origin
Europe MSRP 16999.0 192465.0 123.0 0.0 48349.796748 5947025.0 25318.600464 2282.901323 6.410315e+08 3.657433e+11 7.820585e+10 52.365474 21.179101 1.104072e-42
Europe MPG_City 12.0 38.0 123.0 0.0 18.731707 2304.0 3.289509 0.296605 1.082087e+01 4.447800e+04 1.320146e+03 17.561183 63.153699 4.846098e-95

§ ByGroup3.Summary
Descriptive Statistics for CARS
Column Min Max N NMiss Mean Sum Std StdErr Var USS CSS CV TValue ProbT
Origin
USA MSRP 10995.0 81795.0 147.0 0.0 28377.442177 4171484.0 11711.982506 965.988036 1.371705e+08 1.384029e+11 2.002690e+10 41.272157 29.376598 3.686853e-63
USA MPG_City 10.0 29.0 147.0 0.0 19.074830 2804.0 3.982992 0.328512 1.586423e+01 5.580200e+04 2.316177e+03 20.880878 58.064396 8.933082e-103

elapsed 0.00226s · user 0.003s · mem 0.672MB

Exercise 8 - SQL

SQL stands for Structured Query Language. According to ANSI (American National Standards Institute), it is the standard language for relational database management systems. In this section we illustrate how to use SQL to communicate with CAS server.


In [87]:
conn.loadactionset('fedsql')


NOTE: Added action set 'fedsql'.
Out[87]:
§ actionset
fedsql

elapsed 0.0212s · sys 0.003s · mem 0.0781MB

8.1 Execute an SQL query

ExecDirect is a CAS action in the FedSQL action set to execute SQL statements in the CAS environment.


In [88]:
str1 = "select make, model, msrp, mpg_highway from cars where msrp > 80000 and mpg_highway > 20"
conn.fedsql.execDirect(query=str1)


Out[88]:
§ Result Set
Make Model MSRP MPG_Highway
0 Acura NSX coupe 2dr manual S 89765.0 24.0
1 Audi RS 6 4dr 84600.0 22.0
2 Jaguar XKR coupe 2dr 81995.0 23.0
3 Jaguar XKR convertible 2dr 86995.0 23.0
4 Mercedes-Benz CL500 2dr 94820.0 24.0
5 Mercedes-Benz S500 4dr 86970.0 24.0
6 Mercedes-Benz SL500 convertible 2dr 90520.0 23.0
7 Mercedes-Benz SL55 AMG 2dr 121770.0 21.0
8 Porsche 911 Carrera 4S coupe 2dr (convert) 84165.0 24.0
9 Porsche 911 GT2 2dr 192465.0 24.0

elapsed 0.126s · user 0.016s · sys 0.013s · mem 3.64MB

8.2 Group-by and summary statistics


In [89]:
str1 = "select origin, mean(mpg_highway), std(mpg_highway) from cars group by origin"
conn.fedsql.execDirect(query=str1)


Out[89]:
§ Result Set
Origin MEAN STD
0 Asia 28.265823 6.770503
1 Europe 26.008130 4.167588
2 USA 26.013605 5.396582

elapsed 0.0203s · user 0.021s · sys 0.005s · mem 10.2MB

Show the SQL logic


In [90]:
str1 = "select origin, mean(mpg_highway), std(mpg_highway) from cars group by origin"
conn.fedsql.execDirect(query=str1, method=True)


Methods for full query plan
----------------------------
        Agg 
          Sort 
            SeqScan from CASUSER(viyauser).CARS 
 
Methods for stage 1
--------------------
        Agg 
            SeqScan with _pushed_ order by from CASUSER(viyauser).CARS 
 
        Stage query: select "T1"."Origin", AVG ("T1"."MPG_Highway")  as "MEAN", STDDEV ("T1"."MPG_Highway")  as "STD" from "CASUSER(viyauser)"."CARS" {options tableID=1} T1 group by "T1"."Origin"
 
        Number of SQL I/O threads: 4
 
Out[90]:
§ Result Set
Origin MEAN STD
0 Asia 28.265823 6.770503
1 Europe 26.008130 4.167588
2 USA 26.013605 5.396582

elapsed 0.0155s · user 0.018s · sys 0.005s · mem 9.92MB

8.3 Create a new CAS table using SQL


In [91]:
str1 = """
    create table aggregated1 as 
        select origin, mean(mpg_highway), std(mpg_highway) 
        from cars group by origin
"""

conn.fedsql.execDirect(query=str1)


NOTE: Table AGGREGATED1 was created in caslib CASUSER(viyauser) with 3 rows returned.
Out[91]:

elapsed 0.0137s · user 0.022s · sys 0.005s · mem 10MB


In [92]:
conn.tableinfo()


Out[92]:
§ TableInfo
Name Rows Columns Encoding CreateTimeFormatted ModTimeFormatted JavaCharSet CreateTime ModTime Global Repeated View SourceName SourceCaslib Compressed Creator Modifier
0 _T_8B910465_7F1F62C95F88 428 15 utf-8 09Sep2016:13:45:32 09Sep2016:13:45:32 UTF8 1.789048e+09 1.789048e+09 0 0 0 _T_8B910465_7F1F62C95F88 0 viyauser
1 CARS 428 15 utf-8 09Sep2016:13:45:32 09Sep2016:13:45:32 UTF8 1.789048e+09 1.789048e+09 0 0 0 0 viyauser
2 DATA.CARS 428 15 utf-8 09Sep2016:13:45:34 09Sep2016:13:45:34 UTF8 1.789048e+09 1.789048e+09 0 0 0 Data/cars.csv CASUSER(viyauser) 0 viyauser
3 AGGREGATED1 3 3 utf-8 09Sep2016:13:45:43 09Sep2016:13:45:43 UTF8 1.789048e+09 1.789048e+09 0 0 0 0 viyauser

elapsed 0.000587s · user 0.001s · mem 0.0718MB

You can delete a table usig the dropTable action.


In [93]:
# conn.dropTable('aggregated1')

Continue to work on the table created by SQL


In [94]:
tbl_temp = conn.CASTable('aggregated1')

In [95]:
tbl_temp.fetch()


Out[95]:
§ Fetch
Selected Rows from Table AGGREGATED1
Origin MEAN STD
0 Asia 28.265823 6.770503
1 Europe 26.008130 4.167588
2 USA 26.013605 5.396582

elapsed 0.00117s · user 0.001s · mem 0.556MB

8.4 Table Join


In [96]:
str1 = """
    create table aggregated2 as 
        select origin, type, mean(mpg_highway), std(mpg_highway) 
        from cars 
        group by origin, type
"""

conn.fedsql.execDirect(query=str1)


NOTE: Table AGGREGATED2 was created in caslib CASUSER(viyauser) with 15 rows returned.
Out[96]:

elapsed 0.0169s · user 0.03s · sys 0.002s · mem 10.5MB


In [97]:
str1 = """
    select a.origin, a.type, b.mean as Average_Origin, a.mean as Average_Type 
    from aggregated2 as a 
        left join aggregated1 as b 
    on b.origin = a.origin
"""

conn.fedsql.execDirect(query=str1)


Out[97]:
§ Result Set
Origin Type AVERAGE_ORIGIN AVERAGE_TYPE
0 Asia Hybrid 28.265823 56.000000
1 Asia Truck 28.265823 22.000000
2 Europe Sports 26.008130 25.130435
3 USA Sports 26.013605 24.222222
4 Asia SUV 28.265823 21.680000
5 Asia Wagon 28.265823 28.181818
6 Europe Wagon 26.008130 26.583333
7 USA Truck 26.013605 20.500000
8 Asia Sedan 28.265823 29.968085
9 Europe SUV 26.008130 18.700000
10 USA SUV 26.013605 20.040000
11 USA Wagon 26.013605 29.714286
12 Asia Sports 28.265823 26.647059
13 Europe Sedan 26.008130 27.115385
14 USA Sedan 26.013605 28.544444

elapsed 0.021s · user 0.017s · sys 0.018s · mem 10.2MB

Exercise 9 - DATA Step

You can submit SAS data step codes using the runCode action! Don't forget the new data set option REPLACE=YES if you want to replace an existing data set in the same CAS server. By default the new table CARS_TEMP is saved to the CASUSER library in the CAS server.


In [98]:
conn.runCode(code="""
    data cars_temp(replace=yes); 
        set cars; 
        sqrt_MSRP = sqrt(MSRP); 
        MPG_avg = (MPG_city+MPG_highway)/2; 
    run;
""")


Out[98]:
§ InputCasTables
casLib Name Rows Columns casTable
0 CASUSER(viyauser) cars 428 15 CASTable('cars', caslib='CASUSER(viyauser)')

§ OutputCasTables
casLib Name Rows Columns casTable
0 CASUSER(viyauser) cars_temp 428 17 CASTable('cars_temp', caslib='CASUSER(viyauser)')

elapsed 0.0466s · user 0.01s · sys 0.005s · mem 8.22MB


In [99]:
conn.tableinfo()


Out[99]:
§ TableInfo
Name Rows Columns Encoding CreateTimeFormatted ModTimeFormatted JavaCharSet CreateTime ModTime Global Repeated View SourceName SourceCaslib Compressed Creator Modifier
0 _T_8B910465_7F1F62C95F88 428 15 utf-8 09Sep2016:13:45:32 09Sep2016:13:45:32 UTF8 1.789048e+09 1.789048e+09 0 0 0 _T_8B910465_7F1F62C95F88 0 viyauser
1 CARS 428 15 utf-8 09Sep2016:13:45:32 09Sep2016:13:45:32 UTF8 1.789048e+09 1.789048e+09 0 0 0 0 viyauser
2 DATA.CARS 428 15 utf-8 09Sep2016:13:45:34 09Sep2016:13:45:34 UTF8 1.789048e+09 1.789048e+09 0 0 0 Data/cars.csv CASUSER(viyauser) 0 viyauser
3 AGGREGATED1 3 3 utf-8 09Sep2016:13:45:43 09Sep2016:13:45:43 UTF8 1.789048e+09 1.789048e+09 0 0 0 0 viyauser
4 AGGREGATED2 15 4 utf-8 09Sep2016:13:45:43 09Sep2016:13:45:43 UTF8 1.789048e+09 1.789048e+09 0 0 0 0 viyauser
5 CARS_TEMP 428 17 utf-8 09Sep2016:13:45:43 09Sep2016:13:45:43 UTF8 1.789048e+09 1.789048e+09 0 0 0 0 viyauser

elapsed 0.000788s · sys 0.001s · mem 0.0752MB


In [100]:
tbl_cars_temp = conn.CASTable('cars_temp')
tbl_cars_temp[['sqrt_MSRP','MPG_avg']].summary()


Out[100]:
§ Summary
Descriptive Statistics for CARS_TEMP
Column Min Max N NMiss Mean Sum Std StdErr Var USS CSS CV TValue ProbT
0 sqrt_MSRP 101.390335 438.708331 428.0 0.0 175.121390 74951.954867 45.959648 2.221543 2112.289203 14027638.00 901947.489887 26.244451 78.828705 1.415239e-256
1 MPG_avg 11.000000 63.000000 428.0 0.0 23.452103 10037.500000 5.408331 0.261422 29.250042 247890.25 12489.768107 23.061177 89.709911 3.448353e-279

elapsed 0.00557s · sys 0.003s · mem 0.662MB

Exercise 10 - Building Analytical Models

You need to load an modeling action before you use it. Let's load the action sets for regression and decision tree models


In [101]:
conn.loadactionset('regression')
conn.loadactionset('decisiontree')

tbl = conn.CASTable('cars')


NOTE: Added action set 'regression'.
NOTE: Added action set 'decisiontree'.

The regression action set contains three models: linear regressions, logistic regressions, and generalized linear models.


In [102]:
conn.help(actionset='regression')


NOTE: Information for action set 'regression':
NOTE:    regression
NOTE:       glm - Fits linear regression models using the method of least squares
NOTE:       genmod - Fits generalized linear regression models
NOTE:       logistic - Fits logistic regression models
Out[102]:
§ regression
name description
0 glm Fits linear regression models using the method...
1 genmod Fits generalized linear regression models
2 logistic Fits logistic regression models

elapsed 0.00434s · user 0.001s · mem 0.0451MB

Let us build a linear regression model to predict the MSRP values of the automobiles using the CARS data set.


In [103]:
output1 = tbl.glm(
    target = 'MSRP',
    nominals = ['Type', 'Origin'],
    inputs = ['Type', 'Origin', 'MPG_City', 'Length', 'Weight']
)

In [104]:
output1


Out[104]:
§ ModelInfo
Model Information
RowId Description Value
0 DATA Data Source CARS
1 RESPONSEVAR Response Variable MSRP

§ NObs
Number of Observations
RowId Description Value
0 NREAD Number of Observations Read 428.0
1 NUSED Number of Observations Used 428.0

§ ClassInfo
Class Level Information
Class Levels Values
0 Type 6.0 Hybrid SUV Sedan Sports Truck Wagon
1 Origin 3.0 Asia Europe USA

§ Dimensions
Dimensions
RowId Description Value
0 NEFFECTS Number of Effects 6
1 NPARMS Number of Parameters 13

§ ANOVA
Analysis of Variance
RowId Source DF SS MS FValue ProbF
0 MODEL Model 10.0 9.788809e+10 9.788809e+09 64.441212 2.863778e-78
1 ERROR Error 417.0 6.334352e+10 1.519029e+08 NaN NaN
2 TOTAL Corrected Total 427.0 1.612316e+11 NaN NaN NaN

§ FitStatistics
Fit Statistics
RowId Description Value
0 RMSE Root MSE 1.232489e+04
1 RSQUARE R-Square 6.071272e-01
2 ADJRSQ Adj R-Sq 5.977057e-01
3 AIC AIC 8.503842e+03
4 AICC AICC 8.504594e+03
5 SBC SBC 8.118493e+03
6 TRAIN_ASE ASE 1.479989e+08

§ ParameterEstimates
Parameter Estimates
Effect Type Origin Parameter DF Estimate StdErr tValue Probt
0 Intercept Intercept 1 -24437.969421 16457.304807 -1.484931 1.383175e-01
1 Type Hybrid Type Hybrid 1 32712.213902 10759.213364 3.040391 2.511673e-03
2 Type SUV Type SUV 1 -3143.063706 3130.508120 -1.004011 3.159557e-01
3 Type Sedan Type Sedan 1 2292.189910 2406.138849 0.952642 3.413231e-01
4 Type Sports Type Sports 1 24150.756192 3062.016347 7.887207 2.737502e-14
5 Type Truck Type Truck 1 -11340.453498 3609.223220 -3.142076 1.797217e-03
6 Type Wagon Type Wagon 0 0.000000 NaN NaN NaN
7 Origin Asia Origin Asia 1 1903.248827 1503.771197 1.265651 2.063449e-01
8 Origin Europe Origin Europe 1 17995.412825 1787.406597 10.067890 1.734237e-21
9 Origin USA Origin USA 0 0.000000 NaN NaN NaN
10 MPG_City MPG_City 1 -742.336207 265.209506 -2.799056 5.363126e-03
11 Length Length 1 175.595793 82.233208 2.135339 3.331544e-02
12 Weight Weight 1 8.436445 1.921768 4.389939 1.438777e-05

§ Timing
Task Timing
RowId Task Time RelTime
0 SETUP Setup and Parsing 0.021932 0.459167
1 LEVELIZATION Levelization 0.003773 0.078990
2 INITIALIZATION Model Initialization 0.002705 0.056633
3 SSCP SSCP Computation 0.013514 0.282928
4 FITTING Model Fitting 0.004904 0.102670
5 CLEANUP Cleanup 0.000867 0.018149
6 TOTAL Total 0.047765 1.000000

elapsed 0.0493s · user 0.006s · sys 0.009s · mem 0.794MB

You can get the results in a ODS-like style too.


In [105]:
from swat.render import render_html
render_html(output1)


Model Information
RowId Description Value
DATAData SourceCARS
RESPONSEVARResponse VariableMSRP
Number of Observations
RowId Description Value
NREADNumber of Observations Read428
NUSEDNumber of Observations Used428
Class Level Information
Class Levels Values
Type6Hybrid SUV Sedan Sports Truck Wagon
Origin3Asia Europe USA
Dimensions
RowId Description Value
NEFFECTSNumber of Effects6
NPARMSNumber of Parameters13
Analysis of Variance
RowId Source DF Sum of Squares Mean Square F Value Pr > F
MODELModel10978880939899788809398.964.4412122272.863778E-78
ERRORError41763343524714151902936.96nannan
TOTALCorrected Total427161231618703nannannan
Fit Statistics
RowId Description Value
RMSERoot MSE12324.890951
RSQUARER-Square0.6071271552
ADJRSQAdj R-Sq0.597705744
AICAIC8503.8421606
AICCAICC8504.5939678
SBCSBC8118.4925157
TRAIN_ASEASE147998889.52
Parameter Estimates
Effect Type Origin Parameter DF Estimate StdErr tValue Probt
InterceptIntercept1-24437.9694216457.304807-1.4849314460.1383175144
TypeHybridType Hybrid132712.21390210759.2133643.04039085350.0025116732
TypeSUVType SUV1-3143.0637063130.5081203-1.0040107180.3159556852
TypeSedanType Sedan12292.18990972406.13884940.95264240890.3413231382
TypeSportsType Sports124150.7561923062.01634677.88720681342.737502E-14
TypeTruckType Truck1-11340.45353609.2232203-3.1420759550.001797217
TypeWagonType Wagon00nannannan
OriginAsiaOrigin Asia11903.2488271503.7711971.26565053970.2063449053
OriginEuropeOrigin Europe117995.4128251787.406596710.0678899021.734237E-21
OriginUSAOrigin USA00nannannan
MPG_CityMPG_City1-742.3362069265.20950581-2.7990558050.0053631256
LengthLength1175.5957930182.2332076472.13533921430.0333154353
WeightWeight18.43644522751.92176834054.38993870880.0000143878
Task Timing
RowId Task Time Relative Time
SETUPSetup and Parsing0.02193212510.4591671201
LEVELIZATIONLevelization0.0037729740.0789903215
INITIALIZATIONModel Initialization0.00270509720.05663344
SSCPSSCP Computation0.01351404190.2829276084
FITTINGModel Fitting0.00490403180.1026699477
CLEANUPCleanup0.000866890.0181490559
TOTALTotal0.04776501661

Next let us build a decision tree model to predict MSRP.


In [106]:
output1 = tbl.dtreetrain(
    target = 'MSRP',
    nominals = ['Type', 'Origin'],
    inputs = ['Type', 'Origin', 'MPG_City', 'Length', 'Weight'],
    casout = conn.CASTable("treeModel"),
    varImp = True
)

In [107]:
output1


Out[107]:
§ ModelInfo
Decision Tree for CARS
Descr Value
0 Number of Tree Nodes 4.700000e+01
1 Max Number of Branches 2.000000e+00
2 Number of Levels 6.000000e+00
3 Number of Leaves 2.400000e+01
4 Number of Bins 2.000000e+01
5 Minimum Size of Leaves 5.000000e+00
6 Maximum Size of Leaves 1.030000e+02
7 Number of Variables 5.000000e+00
8 Alpha for Cost-Complexity Pruning 0.000000e+00
9 Number of Observations Used 4.280000e+02
10 Maximum STD of Leaves 4.922019e+04
11 Minimum STD of Leaves 9.862942e+02
12 Mean Squared Error 9.431782e+07

§ DTreeVarImpInfo
Decision Tree for CARS
Variable Importance Std Count
0 MPG_City 4.796464e+10 8.995686e+09 6.0
1 Origin 4.186990e+10 0.000000e+00 1.0
2 Type 1.592548e+10 2.559214e+09 8.0
3 Length 1.181178e+10 2.312614e+09 4.0
4 Weight 3.291795e+09 7.979902e+08 4.0

§ OutputCasTables
casLib Name Rows Columns casTable
0 CASUSER(viyauser) treeModel 47 31 CASTable('treeModel', caslib='CASUSER(viyauser)')

elapsed 0.0218s · user 0.005s · sys 0.007s · mem 2.19MB

The treeModel1 CASTable stores the decision tree model. Each row contains the information for an individual split in the decision tree.


In [108]:
conn.CASTable("treeModel1").fetch()


ERROR: The file or path 'treeModel1' is not available in the file system.
ERROR: Table 'treeModel1' could not be loaded.
ERROR: Failure opening table 'treeModel1': A table could not be loaded.
ERROR: The action stopped due to errors.
Out[108]:

elapsed 6e-05s · user 0.486s · sys 0.276s

Last but not least, let us visualize the importance of the predictors using Cufflinks.


In [109]:
output1['DTreeVarImpInfo'].sort_values(by='Importance').iplot(kind='barh',x='Variable',y='Importance')